͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Send the gift of Architecture Weekly this holiday season

Send a gift

Distributed Locking: A Practical Guide

Oskar Dudycz

Dec 23

∙

Preview

READ IN APP

Was your data ever mysteriously overwritten? No? Think again. Have you noticed conflicting updates to the same data? Still nope? Lucky you! I had cases when my data changed, and it looked like Marty was travelling back to the future. Once it was correct, then it was wrong, then correct again. Most of the time, twin services were writing to the sample place.

In distributed systems, coordination is crucial. Whenever you have multiple processes (or services) that could update or read the same data simultaneously, you risk data corruption, race conditions, or unwanted duplicates. The popular solution is to use distributed locks - a mechanism ensuring only one process can operate on a resource at a time.

A distributed lock ensures that if one actor (node, service instance, etc.) changes a shared resource—like a database record, file, or external service—no other node can step in until the first node is finished. While the concept is straightforward, implementing it across multiple machines requires careful design and a robust failure strategy.

Today, we’ll try to discuss it and:

Explain why distributed locking is needed in real-world scenarios.
Explore how popular tools and systems implement locks (Redis, ZooKeeper, databases, Kubernetes single-instance setups, etc.).
Discuss potential pitfalls—like deadlocks, lock expiration, and single points of failure—and how to address them.

By the end, you should have a decent grasp of distributed locks, enough to make informed decisions about whether (and how) to use them in your architecture.

1. Why Distributed Locks Matter

When you scale an application to multiple machines or microservices, each one might update the same resource simultaneously. That can lead to writers overwriting each other. It’s the classic concurrency problem: who’s in charge of updating shared data? Without a strategy to coordinate these updates, you risk inconsistent results.

For example, you might have:

Multiple Writers are updating the same table row. One process believes it has the latest data and changes, but another process overwrites it moments later.
That’s a common issue when updating read models. We need to ensure a single subscriber for a specific read model. When multiple microservices listen to an event stream but only one should update a secondary database, a lock ensures they don’t collide.
At-Least-Once Messaging leading to repeated deliveries of the same message, with multiple consumers each processing it as if it were unique.
A Global Resource that can only support one writer at a time, such as a large file or an external API endpoint.
Scheduled jobs: If you schedule a job on multiple nodes to ensure reliability, you still want only the first node that grabs the lock to run it, preventing duplicates.

A distributed lock is the simplest way to say, “Only one node can modify this resource until it’s finished.” Other nodes wait or fail immediately rather than attempting parallel writes.

It can also be seen as Leader Election Lite: If you only need “one node at a time” for a particular task, a plain lock is sometimes enough - there is no need for a full-blown leader election framework.

Without locks, you can get unpredictable states - like a read model flipping from correct to incorrect or a file partially overwritten by multiple workers. Locks sacrifice a bit of parallelism for the certainty that no two nodes update the same resource simultaneously. In many cases, that’s the safest trade-off, especially if data correctness is paramount.

There are other ways to handle concurrency (like idempotent actions or write-ahead logs). Still, a distributed lock is often the most direct solution when you truly need to avoid simultaneous writes.

2. How Locks Typically Work

Lock Request
A node determines it needs exclusive access to a shared resource and communicates with a lock manager (e.g., Redis, ZooKeeper, or a database) that supports TTL or ephemeral locks.
Acquire Attempt
- If no lock currently exists, the node creates one, for instance by setting a Redis key with a time-to-live or by creating an ephemeral zNode in ZooKeeper.
- If another node is holding the lock, this node either waits, fails immediately, or retries, depending on your chosen policy.
Critical Operation
After acquiring the lock, the node proceeds with the operation that must not be run in parallel—such as updating a database row, writing to a shared file, or sending a limited external API request.
Standard Release
Once finished, the node explicitly removes the lock record (e.g., deleting the Redis key or calling an unlock function). At this point, other contenders are free to acquire the lock.
Crash or Automatic Release
If the node fails or loses its session, the TTL expires, or the ephemeral mechanism detects the node’s absence. The lock manager then removes the lock, automatically merging into the same path as a standard release so a new node can acquire it. This avoids leaving the system stuck with a “zombie” lock that never gets freed.

The basic flow would look like:

And the acquisition part with TTL handling:

3. Tools for Distributed Locking

There are many tools for distributed locking; let's check the most popular for certain categories.

Redis (and clones): You store a “lock key” in Redis with a time limit attached. If the lock key doesn’t exist, you create it and claim the lock. Once your task finishes, you remove the key. If your process crashes, Redis removes the key after the time limit expires, so no one stays locked out forever. This method is quick and works well if you already rely on Redis, but you should be cautious about network splits or cluster issues.
ZooKeeper / etcd: These tools consistently keep data across a group of servers. You write a small record saying, “I own this lock,” and if you go offline, the system notices and removes your record automatically. ZooKeeper and etcd are often used for more advanced cluster coordination or leader election, so they’re more complex to set up than Redis. However, they offer strong consistency guarantees.
Database Locks: If you have a single database (like Postgres or MySQL) where all your data lives, you can let the database control concurrency. You request a named lock, and the database blocks any other request for that same lock name until you release it. It’s convenient if you already use a single DB for everything, but it may not scale well across multiple databases or regions, and frequent locking could cause contention.
Kubernetes Single-Instance: Instead of coordinating multiple replicas, you literally run only one replica of your service. Because there’s only one service process, you avoid concurrency altogether at the application level. It’s the simplest approach - no lock code is needed. The downside is that you lose parallel processing and can’t spin up more replicas for high availability or scaling.

Distributed locks all share a common goal: ensure only one node does a particular thing at any given time. However, each tool mentioned approaches the problem with distinct designs, strengths, and failover behaviours.

Let’s look at each tool’s big-picture purpose—why you’d even consider it—then move on to how it implements (or approximates) a lock. Lastly, let’s discuss a few technical details that matter once you start coding or troubleshooting...

Continue reading this post for free in the Substack app

Claim my free post

Or upgrade your subscription. Upgrade to paid

Like

Comment

Restack

On getting the meaningful discussions, and why that's important

Thursday, December 19, 2024

To put our design into practice, we need to be able to persuade our colleagues, stakeholders, and other peers. Without the ability to explain and persuade, even the best design will not be applied. And

The Write-Ahead Log: The underrated Reliability Foundation for Databases and Distributed systems

Tuesday, December 10, 2024

The write-ahead log (WAL) is everywhere. Yet, many people miss it and are not aware of it. The simple idea powers reliability in databases, messaging systems, and distributed systems. Let's discuss

Applying Observability: From Strategy to Practice with Hazel Weakly

Monday, December 2, 2024

Have you considered applying observability but struggled to match the strategy with the tooling? Or maybe you were lost on how to do it? I have something for you!I had a great discussion with Hazel

Get full 30-day access to all Architecture Weekly content for free

Friday, November 29, 2024

Hey! Thank you for being a subscriber; that's much appreciated. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions

Monday, November 25, 2024

This week, we'll discuss the deduplication strategies. We'll see whether they're useful and consider scenarios where you may need them. We'll also do a reality check with the promises

Data Science Weekly - Issue 588

Thursday, February 27, 2025

Curated news, articles and jobs related to Data Science, AI, & Machine Learning ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

💎 Issue 458 - Why Ruby on Rails still matters

Thursday, February 27, 2025

This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular Ruby news, articles and

📱 Issue 452 - Three questions about Apple, encryption, and the U.K

Thursday, February 27, 2025

This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 452 Release Date Feb 27, 2025 Your weekly report of the most popular iOS news, articles and projects Popular

💻 Issue 451 - .NET 10 Preview 1 is now available!

Thursday, February 27, 2025

This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 451 Release Date Feb 27, 2025 Your weekly report of the most popular .NET news, articles and projects

💻 Issue 458 - Full Stack Security Essentials: Preventing CSRF, Clickjacking, and Ensuring Content Integrity in JavaScript

Thursday, February 27, 2025

This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular Node.js news, articles and

Architecture Weekly - Distributed Locking: A Practical Guide

Distributed Locking: A Practical Guide

1. Why Distributed Locks Matter

2. How Locks Typically Work

3. Tools for Distributed Locking

Continue reading this post for free in the Substack app

Older messages

On getting the meaningful discussions, and why that's important

The Write-Ahead Log: The underrated Reliability Foundation for Databases and Distributed systems

Applying Observability: From Strategy to Practice with Hazel Weakly

Get full 30-day access to all Architecture Weekly content for free

Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions

You Might Also Like

Data Science Weekly - Issue 588

💎 Issue 458 - Why Ruby on Rails still matters

📱 Issue 452 - Three questions about Apple, encryption, and the U.K

💻 Issue 451 - .NET 10 Preview 1 is now available!

💻 Issue 458 - Full Stack Security Essentials: Preventing CSRF, Clickjacking, and Ensuring Content Integrity in JavaScript

💻 Issue 458 - TypeScript types can run DOOM

💻 Issue 453 - Linus Torvalds Clearly Lays Out Linux Maintainer Roles Around Rust Code

💻 Issue 376 - Top 10 React Libraries/Frameworks for 2025 🚀

February 27th 2025

📱 Issue 455 - How Swift's server support powers Things Cloud