March 01, 2025 | Read Online

LUC #75: Database Sharding Explained — Strategies for Scalable Database Management

This week’s issue brings you:

Database Sharding Explained: Strategies for Scalable Database Management
Kafka Explained (Recap)
How Elasticsearch Works (Recap)
URI vs URL vs URN — Do You Know The Differences? (Recap)

READ TIME: 5 MINUTES

Thanks to our partners who keep this newsletter free to the reader.

The Only Cloud-native Kafka Implementation Validated by Jepsen

Bufstream is the only cloud-native Kafka implementation independently validated by Jepsen, passing the gold standard for distributed systems testing. It’s built for the modern enterprise—stateless, auto-scaling, schema-aware, and 8x cheaper than self-managed Kafka.

Designed for high-throughput workloads, Bufstream is ready for organizations that struggle with Kafka scalability, cloud cost control, and data quality.

Find out how Bufstream can improve your Kafka implementation in this article.

Database Sharding Explained: Strategies for Scalable Database Management

What’s one of the most common bottlenecks for an application?

You guessed it—the database.

To keep our systems performant and scalable in the face of enormous volumes of data and the need for rapid processing, implementing database scaling solutions is crucial.

There are several database scaling solutions. One of the most powerful but also most complex is database sharding.

Today, we’ll explore database sharding, when and where to use it, and best practices.

Let’s dive in!

Understanding Database Sharding

Database sharding divides a database into smaller, more manageable segments known as "shards," which are distributed across various servers.

This approach differs from conventional scaling strategies like replication, which makes duplicate copies of data across several servers, and vertical scaling, which entails boosting the capacity of an already-existing server.

The main benefit of sharding is its capacity to distribute data throughout a network of computers, greatly enhancing scalability and performance.

Sharding is done via two approaches—horizontal sharding, and vertical sharding.

Horizontal Sharding

Horizontal sharding, also known as data partitioning, splits a database by row.

Each shard holds the same schema but contains a different subset of the data.

This is done by applying a consistent sharding key or algorithm to distribute rows across multiple databases or servers.

For example, user data could be sharded based on geographic location or user IDs, so that all data related to a particular region or range of user IDs is stored together.

Vertical Sharding

Vertical sharding involves dividing a database into smaller sets based on the table. This method separates different tables or groups of tables into distinct databases, with each shard holding a subset of the table data.

Vertical sharding is particularly useful when certain tables grow significantly larger or are accessed more frequently than others. By isolating heavily accessed tables, vertical sharding can reduce the load on a single database server and improve performance for specific queries.

Both sharding techniques facilitate horizontal scaling, i.e., adding more machines to a system to distribute/spread the load.

How they do this differs.

Vertical sharding is table-centric, making it well-suited for databases where specific tables are disproportionately larger or heavily queried.

Horizontal sharding, on the other hand, is data-centric, which makes it better suited for evenly distributing a large volume of similar data across several servers.

Database Scaling Techniques That Should Be Exhausted Before Sharding

Before you start sharding your database, an important principle should be kept in mind.

You shouldn’t implement premature optimizations or attempt to scale your app before it’s actually needed. Implementing scaling solutions introduces complexities such as:

Adding new features takes longer
The system becomes more complex with more pieces and variables involved
Code can be more difficult to test
Finding and resolving bugs becomes harder

You should only accept these trade-offs if your app is at capacity. Keep the system simple, don’t introduce scaling complexities unless it’s warranted.

Database sharding is complex. Several more straightforward solutions might address performance issues effectively:

Vertical scaling

Adding resources to your existing server may provide a short improvement in performance, but it has cost and scalability restrictions.

Database and query optimization

Significant speed improvements can be achieved here, with minimal complexity.

Connection pooling

Application speed can be enhanced and overhead can be significantly decreased by managing database connections more effectively.

Read replicas

Enabling read replicas can assist in offloading read activities from the primary database, improving read performance.

Caching

A powerful yet simple solution. Utilizing caching to store frequently used data can significantly reduce database load.

Database partitioning

Splitting large tables into smaller more manageable pieces inside the same database can enhance data management without the need for sharding.

Why Shard a Database?

Sharding becomes relevant when applications hit scalability ceilings and performance bottlenecks that simpler approaches can't mitigate.

By distributing data across multiple servers, it reduces the load on any single server, enhances response times, and offers a scalable architecture that grows with the application.

Challenges and Considerations

Despite its advantages, database sharding poses several issues and complications.

These include choosing the right sharding key, handling cross-shard transactions, and maintaining data consistency between shards.

Careful preparation and implementation are essential for effectively navigating these tricky obstacles.

Best Practices for Database Sharding

Sharding key selection

To prevent unequal load distribution, use a sharding key that distributes data uniformly among shards.

Consistent hashing

When scaling the shard design, use consistent hashing for the shard distribution to reduce the effect.

Monitoring and automation

Implement monitoring to track shard performance. And use automation for shard maintenance and data rebalancing.

Limit cross-shard transactions

As cross-shard transactions can impede performance and complicate processes, it is best to design the application to minimize them.

Shard proximity

To improve access times for applications that are sensitive to latency, take into account the physical location of shards.

Extensive testing

Test the sharding technique extensively under real-world conditions to detect and resolve any concerns.

Future growth planning

Allow for expected changes in data volume and access patterns while designing the sharding system with potential expansion in mind.

Wrapping Up

Database sharding provides a scalable framework for applications that need to grow beyond a centralized system.

While it is very powerful, it’s also one of the most complex database scaling solutions.

Therefore, more straightforward scaling solutions should be exhausted prior to implementing sharding. If sharding is taken on, it should be carefully planned and implemented to navigate the tricky challenges it comes with.e

Kafka Explained (Recap)

Kafka is an open-source distributed streaming platform designed for building real-time data pipelines and streaming applications.

Kafka operates as a distributed pub-sub messaging system. Allowing applications to publish and subscribe to real-time or near-real-time data feeds.

The high throughput, scalability, fault-tolerance, durability, and ecosystem Kafka provides have made it a very popular choice for use cases where real-time data feeds are required.

The key components of Kafka include Producer, Consumer, Broker, Topic, and Partition.

Kafka has many use cases, from aggregating data from different sources to monitoring and real-time analytics.

Check out the full post for an extended explanation of Kafka.

How Elasticsearch Works (Recap)

Elasticsearch stands out as a key tool in search and analytics, valued for its real-time data processing. As a core component of the ELK stack, it integrates seamlessly with data visualization tools and log processors, enhancing its utility.

To get a better picture of how it works, let’s look at its workflow:

𝟭) Data ingestion — begins by importing data in JSON format via logstash, beats, or direct input.

𝟮) Indexing — data is indexed using an inverted index that facilitates rapid text searches and links terms to document locations.

𝟯) Sharding and replication — distributes data across nodes to enhance fault tolerance and availability.

𝟰) Searching — utilizes a query DSL for efficient data retrieval from the inverted index.

𝟱) Analysis and aggregations — allows for complex data analysis and insights into trends.

𝟲) Results retrieval — delivers query results in near real-time, optimizing response efficiency.

Check out the full post for an extended explanation of Elasticsearch.

URI vs URL vs URN — Do You Know The Differences?

URI (Uniform Resource Identifier): A general identifier for a resource, either by location, name, or both. URLs and URNs are subtypes of URIs.

URL (Uniform Resource Locator): A type of URI that specifies how to locate a resource, including the protocol (e.g., HTTPS), domain, and path (e.g., https://example.com/path). URLs enable access to resources.

URN (Uniform Resource Name): A type of URI that uniquely identifies a resource by name within a namespace (e.g., urn:isbn:0361450721 for a book), but does not specify how to locate it.

Every URL is a URI, but not every URI is a URL. URNs identify, URLs locate, and URIs encompass both.

That wraps up this week’s issue of Level Up Coding’s newsletter!

Join us again next week where we’ll explore and visually distill more important engineering concepts.

Update your email preferences or unsubscribe here

228 Park Ave S, #29976, New York, New York 10003, United States

Powered by beehiiv

Terms of Service

LUC #75: Database Sharding Explained — Strategies for Scalable Database Management

LUC #75: Database Sharding Explained — Strategies for Scalable Database Management

Database Sharding Explained: Strategies for Scalable Database Management

Understanding Database Sharding

Horizontal Sharding

Vertical Sharding

Database Scaling Techniques That Should Be Exhausted Before Sharding

Why Shard a Database?

Challenges and Considerations

Best Practices for Database Sharding

Wrapping Up

Kafka Explained (Recap)

How Elasticsearch Works (Recap)

URI vs URL vs URN — Do You Know The Differences?

Older messages

Who's Coaching the Coach?

Why AI-produced articles are difficult to master

OpenAI’s underestimated us!

Influence Weekly #378 - YouTube Star MrBeast Is Raising Money at a $5 Billion Valuation

"Notes" of An Elder ― To Dispense Love (A Special Gift Just for You Inside)

You Might Also Like

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

🦅 Limited-access masterclass with Arvid Kahl

AI at Scale: Learnings from monday.com Co-Founder Eran Zinman

Who's Coaching the Coach?

Forecasting VC returns by vertical

First 8 steps to $110K MRR

Why AI-produced articles are difficult to master

Amazon Reveals It Had 20.93 Billion Searches in December - CrewReview

How AI Search Handles Citations, Google’s Latest Lawsuit + 2 Weird Niche Sites