How does Kafka know what was the last message it processed? Deep dive into Offset Tracking
Let’s say it’s Friday. Not party Friday, but Black Friday. You’re working on a busy e-commerce system that handles thousands of orders per minute. Suddenly, the service responsible for billing processing crashes. Until it recovers, new orders are piling up. How do you resume processing after the service restart? Typically, you use the messaging system to accept incoming requests and then process them gradually. Messaging systems have durable storage capabilities to keep messages until they’re delivered. Kafka can even keep them longer with a defined retention policy. If we’re using Kafka, when the service restarts, it can resubscribe to the topic. But which messages should we process? One naive approach to ensure consistency might be reprocessing messages from the topic's earliest position. That might prevent missed events, but it could also lead to a massive backlog of replayed data—and a serious risk of double processing. Would you really want to read every message on the topic from the very beginning, risking duplicate charges and triggering actions that have already been handled? Wouldn’t it be better to pick up exactly where you left off, with minimal overhead and no guesswork about what you’ve already handled? Not surprisingly, that’s what Kafka does: it has built-in offset tracking. What’s offset? Offsets let each consumer record its precise position in the messages stream (logical position in the topic portion). That ensures that restarts or redeployments don’t force you to re-ingest everything you’ve ever consumed. Services that consume message streams can restart or be redeployed for countless reasons: you might update a container in Kubernetes, roll out a patch on a bare-metal server, or autoscale in a cloud environment. In this article, we’ll look at how Kafka manages offsets under the hood, the failure scenarios you must prepare for, and how offsets help you keep your system consistent—even when services are constantly starting and stopping. We’ll also see how other technologies tackle a similar challenge. As always, we take a specific tool and try to extend it to the wide architecture level. Check also previous articles in these series: Let’s do the thought experiment. Stop now for a moment and consider how you would implement offset storage for a messaging system. Remember your findings, or better yet, note them down. Are you ready? Let’s now discuss various options we could use, and check how Kafka offset storage evolved and why. 1. Early Attempts at Storing OffsetsDatabase Storage. A straightforward way to track what a consumer has processed is to store the last processed message ID in a relational database or key-value store. This can be fine if you have a single consumer, but you run into trouble as soon as you introduce multiple consumers. Each consumer must update the same record, raising the risk of race conditions, distributed locking or expensive transactions. Plus, if one consumer crashes mid-update, another might not know which offset is the latest. Local Files. Another idea is to write the current offset to a local file on the machine running the consumer. This might spare you the overhead of a database, but it quickly becomes unmanageable if the machine dies or you need to scale out. Each new consumer instance has its own file, and there’s no easy way to keep them all consistent. As soon as you have more than one consumer—or if you want failover without losing track of progress these approaches break down. That’s why Kafka uses a different model. 2. From ZooKeeper to special topicHistorically, Kafka itself used Apache ZooKeeper to store consumer offsets. ZooKeeper is a distributed key-value store with additional coordination features. It allows distributed systems to store and retrieve data across multiple servers. It acts like a centralized database that helps different parts of a system share and manage configuration information and state. Unlike a simple key-value store, it provides features like atomic writes, watches (notifications), and hierarchical naming, making it particularly useful for configuration management and synchronization in distributed computing environments. Using Zookeeper to store offsets might have worked fine when you had only a few consumer groups or infrequent commits, but it wasn’t designed to handle the constant stream of updates in larger deployments. Every offset commit triggered writes to ZooKeeper, and as the number of consumer groups and partitions grew, those writes multiplied, creating performance bottlenecks and stability concerns. ZooKeeper was never intended for high-throughput offset tracking, so teams running sizable Kafka clusters began encountering scaling issues—ranging from slower commit latencies to potential coordination timeouts—when offset commits overloaded ZooKeeper. To solve these problems, Kafka introduced the
Version 0.9 still allowed using Zookeeper through the offsets.storage=zookeeper 3. How offsets topic worksAlthough Topic Structure and PartitioningBy default, it’s partitioned into a fixed number of partitions (often 50), with each partition managing offset commits for a subset of consumer groups. This design ensures that offset storage can scale horizontally—no single partition is overloaded by too many commits. In this simplified flow, a consumer sends an offset commit to the Kafka broker, which appends a commit event to the appropriate partition in You should always consider how frequently you commit offsets to get the desired performance. Committing after every message assures minimal replays but burdens the broker with frequent writes. Committing in larger batches cuts down on overhead but means a crash could reprocess a bigger chunk of messages. Monitoring metrics like consumer lag, commit latency, and rebalance frequency helps you tune these factors for your workloads. Message FormatUnder the hood, each offset commit is stored as a small message. While the actual format is internal to Kafka, you can think of it in a simplified TypeScript-esque interface:
Continue reading this post for free in the Substack app |
Older messages
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
Pssst, do you want to learn some Event Sourcing?
Friday, February 14, 2025
Hi! Does Event Sourcing tempt you but don't know where to start? Is your business losing data? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
How a Kafka-Like Producer Writes to Disk
Monday, January 13, 2025
We take a Kafka client, call the producer, send the message, and boom, expect it to be delivered on the other end. And that's actually how it goes. But wouldn't it be nice to understand better
Invitation to the Event Sourcing workshop
Friday, January 10, 2025
Hey! I'm usually not making New Year commitments. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Thoughts on Platforms, Core Teams, DORA Report and all that jazz
Monday, January 6, 2025
Everyone's hyping “platform teams” like they're the next big thing—yet I see so many struggling, often for the same reasons core teams do. In latest edition I dive into why these big, central
You Might Also Like
Software Testing Weekly - Issue 261
Wednesday, March 12, 2025
New AI solutions for testing 👀 View on the Web Archives ISSUE 261 March 12th 2025 COMMENT Welcome to the 261st issue! I have nothing more to add to the genuinely great news that came out recently. I
JSK Daily for Mar 11, 2025
Tuesday, March 11, 2025
JSK Daily for Mar 11, 2025 View this email in your browser A community curated daily e-mail of JavaScript news How to Enforce Type Safety in FormData with TypeScript When working with the FormData
Binary Data, Tail Calls, Pickles, and More
Tuesday, March 11, 2025
Bytes Objects: Handling Binary Data in Python #672 – MARCH 11, 2025 VIEW IN BROWSER The PyCoder's Weekly Logo Bytes Objects: Handling Binary Data in Python In this tutorial, you'll learn about
Shaking The Wasp’s Nest 🐝
Tuesday, March 11, 2025
How Gamergate swarmed into our online lives. Here's a version for your browser. Hunting for the end of the long tail • March 11, 2025 Today In Tedium: You probably have noticed, just like me, that
Daily Coding Problem: Problem #1714 [Easy]
Tuesday, March 11, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. You are given an N by N matrix of random letters and a dictionary of words. Find
Mapped | The State of Democracy Around the World 🌐
Tuesday, March 11, 2025
After a historic election year, we show the state of democracy worldwide as it declines to its lowest level in two decades. View Online | Subscribe | Download Our App NEW REPORT: The Age of Data >
Stories, Free Tool & CRM Template
Tuesday, March 11, 2025
Notion stories, smart tools, and a free template to organize your contacts 🔥 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
LW 173 - How to become a Shopify Developer in 2025
Tuesday, March 11, 2025
How to become a Shopify Developer in 2025 Shopify Development news and articles Issue 173 - 03/11
This free AI tool beats Perplexity
Tuesday, March 11, 2025
Ubuntu vs. Debian; The new HR; YouTube randomizer -- ZDNET ZDNET Tech Today - US March 11, 2025 webfeetgettyimages-10141124 DuckDuckGo's AI beats Perplexity in one big way - and it's free to
⚙️ AI bubble bursts (?)
Tuesday, March 11, 2025
Plus: We talk to the CEO of Read AI