|
This week’s issue brings you: |
|
READ TIME: 5 MINUTES |
|
Thanks to our partners who keep this newsletter free to the reader. | The Only Cloud-native Kafka Implementation Validated by Jepsen | Bufstream is the only cloud-native Kafka implementation independently validated by Jepsen, passing the gold standard for distributed systems testing. It’s built for the modern enterprise—stateless, auto-scaling, schema-aware, and 8x cheaper than self-managed Kafka. | Designed for high-throughput workloads, Bufstream is ready for organizations that struggle with Kafka scalability, cloud cost control, and data quality. | Find out how Bufstream can improve your Kafka implementation in this article. | |
|
|
|
Database Sharding Explained: Strategies for Scalable Database Management |
What’s one of the most common bottlenecks for an application? |
You guessed it—the database. |
To keep our systems performant and scalable in the face of enormous volumes of data and the need for rapid processing, implementing database scaling solutions is crucial. |
There are several database scaling solutions. One of the most powerful but also most complex is database sharding. |
Today, we’ll explore database sharding, when and where to use it, and best practices. |
Let’s dive in! |
Understanding Database Sharding |
Database sharding divides a database into smaller, more manageable segments known as "shards," which are distributed across various servers. |
This approach differs from conventional scaling strategies like replication, which makes duplicate copies of data across several servers, and vertical scaling, which entails boosting the capacity of an already-existing server. |
The main benefit of sharding is its capacity to distribute data throughout a network of computers, greatly enhancing scalability and performance. |
Sharding is done via two approaches—horizontal sharding, and vertical sharding. |
Horizontal Sharding |
Horizontal sharding, also known as data partitioning, splits a database by row. |
Each shard holds the same schema but contains a different subset of the data. |
This is done by applying a consistent sharding key or algorithm to distribute rows across multiple databases or servers. |
For example, user data could be sharded based on geographic location or user IDs, so that all data related to a particular region or range of user IDs is stored together. |
|
Vertical Sharding |
Vertical sharding involves dividing a database into smaller sets based on the table. This method separates different tables or groups of tables into distinct databases, with each shard holding a subset of the table data. |
Vertical sharding is particularly useful when certain tables grow significantly larger or are accessed more frequently than others. By isolating heavily accessed tables, vertical sharding can reduce the load on a single database server and improve performance for specific queries. |
|
Both sharding techniques facilitate horizontal scaling, i.e., adding more machines to a system to distribute/spread the load. |
How they do this differs. |
Vertical sharding is table-centric, making it well-suited for databases where specific tables are disproportionately larger or heavily queried. |
Horizontal sharding, on the other hand, is data-centric, which makes it better suited for evenly distributing a large volume of similar data across several servers. |
Database Scaling Techniques That Should Be Exhausted Before Sharding |
Before you start sharding your database, an important principle should be kept in mind. |
You shouldn’t implement premature optimizations or attempt to scale your app before it’s actually needed. Implementing scaling solutions introduces complexities such as: |
Adding new features takes longer The system becomes more complex with more pieces and variables involved Code can be more difficult to test Finding and resolving bugs becomes harder
|
You should only accept these trade-offs if your app is at capacity. Keep the system simple, don’t introduce scaling complexities unless it’s warranted. |
|
Database sharding is complex. Several more straightforward solutions might address performance issues effectively: |
Vertical scaling |
Adding resources to your existing server may provide a short improvement in performance, but it has cost and scalability restrictions. |
Database and query optimization |
Significant speed improvements can be achieved here, with minimal complexity. |
Connection pooling |
Application speed can be enhanced and overhead can be significantly decreased by managing database connections more effectively. |
Read replicas |
Enabling read replicas can assist in offloading read activities from the primary database, improving read performance. |
Caching |
A powerful yet simple solution. Utilizing caching to store frequently used data can significantly reduce database load. |
Database partitioning |
Splitting large tables into smaller more manageable pieces inside the same database can enhance data management without the need for sharding. |
Why Shard a Database? |
Sharding becomes relevant when applications hit scalability ceilings and performance bottlenecks that simpler approaches can't mitigate. |
By distributing data across multiple servers, it reduces the load on any single server, enhances response times, and offers a scalable architecture that grows with the application. |
Challenges and Considerations |
Despite its advantages, database sharding poses several issues and complications. |
These include choosing the right sharding key, handling cross-shard transactions, and maintaining data consistency between shards. |
Careful preparation and implementation are essential for effectively navigating these tricky obstacles. |
Best Practices for Database Sharding |
Sharding key selection |
To prevent unequal load distribution, use a sharding key that distributes data uniformly among shards. |
Consistent hashing |
When scaling the shard design, use consistent hashing for the shard distribution to reduce the effect. |
Monitoring and automation |
Implement monitoring to track shard performance. And use automation for shard maintenance and data rebalancing. |
Limit cross-shard transactions |
As cross-shard transactions can impede performance and complicate processes, it is best to design the application to minimize them. |
Shard proximity |
To improve access times for applications that are sensitive to latency, take into account the physical location of shards. |
Extensive testing |
Test the sharding technique extensively under real-world conditions to detect and resolve any concerns. |
Future growth planning |
Allow for expected changes in data volume and access patterns while designing the sharding system with potential expansion in mind. |
Wrapping Up |
Database sharding provides a scalable framework for applications that need to grow beyond a centralized system. |
While it is very powerful, it’s also one of the most complex database scaling solutions. |
Therefore, more straightforward scaling solutions should be exhausted prior to implementing sharding. If sharding is taken on, it should be carefully planned and implemented to navigate the tricky challenges it comes with.e |
|
Kafka Explained (Recap) |
Kafka is an open-source distributed streaming platform designed for building real-time data pipelines and streaming applications. |
Kafka operates as a distributed pub-sub messaging system. Allowing applications to publish and subscribe to real-time or near-real-time data feeds. |
The high throughput, scalability, fault-tolerance, durability, and ecosystem Kafka provides have made it a very popular choice for use cases where real-time data feeds are required. |
The key components of Kafka include Producer, Consumer, Broker, Topic, and Partition. |
Kafka has many use cases, from aggregating data from different sources to monitoring and real-time analytics. |
Check out the full post for an extended explanation of Kafka. |
|
|
How Elasticsearch Works (Recap) |
Elasticsearch stands out as a key tool in search and analytics, valued for its real-time data processing. As a core component of the ELK stack, it integrates seamlessly with data visualization tools and log processors, enhancing its utility. |
To get a better picture of how it works, let’s look at its workflow: |
𝟭) Data ingestion — begins by importing data in JSON format via logstash, beats, or direct input. |
𝟮) Indexing — data is indexed using an inverted index that facilitates rapid text searches and links terms to document locations. |
𝟯) Sharding and replication — distributes data across nodes to enhance fault tolerance and availability. |
𝟰) Searching — utilizes a query DSL for efficient data retrieval from the inverted index. |
𝟱) Analysis and aggregations — allows for complex data analysis and insights into trends. |
𝟲) Results retrieval — delivers query results in near real-time, optimizing response efficiency. |
Check out the full post for an extended explanation of Elasticsearch. |
|
|
URI vs URL vs URN — Do You Know The Differences? |
URI (Uniform Resource Identifier): A general identifier for a resource, either by location, name, or both. URLs and URNs are subtypes of URIs. |
URL (Uniform Resource Locator): A type of URI that specifies how to locate a resource, including the protocol (e.g., HTTPS), domain, and path (e.g., https://example.com/path ). URLs enable access to resources. |
URN (Uniform Resource Name): A type of URI that uniquely identifies a resource by name within a namespace (e.g., urn:isbn:0361450721 for a book), but does not specify how to locate it. |
Every URL is a URI, but not every URI is a URL. URNs identify, URLs locate, and URIs encompass both. |
|
|
That wraps up this week’s issue of Level Up Coding’s newsletter! |
Join us again next week where we’ll explore and visually distill more important engineering concepts. |
|