When Logs and metrics aren't enough: Discovering Modern Observability
It's 2 a.m., and your phone rings. Your system is experiencing performance degradation. Tasks are piling up in the queues, users are seeing timeouts, and your monitoring dashboard shows that the connection pool is at its limit. Your first instinct? Check the logs. But traditional logging gives you just a fragmented picture: messages without the broader context. You're left wondering: What's really going on? Why are queues growing? Why are requests taking longer to process? Are there specific queries or requests that are slowing things down? And crucially, how do all these isolated problems connect? This is typically the moment when we realise that our system condition is not the easiest to diagnose. It'd be great if we had more data and could understand the context and causation of what has happened. Understanding not just what went wrong but why and how is one of the foundational features of an observable system—and a system that's not a nightmare to run in production. Nowadays, that's even more challenging. The complexity of our distributed systems is at its peak. Our problems are rarely isolated to one part of the stack. We need to correlate information from various places to diagnose the issues. In the last two releases, we took a break and analysed how S3 could be used for more than storing files and how much it costs. Today, let's return to the previous series and discuss the typical challenges of distributed systems. We'll continue to use managing a connection pool for database access as an example and discuss the issues and potential troubleshooting using modern observability tools and techniques like Open Telemetry, Grafana Tempo, etc. See more:
We'll explore how traditional monitoring approaches like logs and metrics fail and how observability—through traces, spans, metrics, and span events—gives you the full picture and helps you understand your system deeply. Connection PoolAs discussed in details earlier, a connection pool manages reusable database connections shared among multiple requests. Instead of opening and closing a connection for each request, we borrow a connection from the pool and return it after database operation. Thanks to that, we're cutting the latency required to establish a connection and get a more resilient and scalable solution. That's okay, but how do you know if it's working correctly? Let's start with the traditional approach to monitoring this setup: metrics and logs. Traditional Monitoring with Metrics and LogsMetricsIn a traditional system, metrics provide real-time, aggregated data that helps you track the overall health of your system. For example, in a connection pool, you could monitor the following key metrics:
You can imagine the basic metrics dashboard:
It could be updated in real-time, brighter and shinier, with colours and animations, etc. Still, the value will be similar to those three lines of text. Such a dashboard gives you an aggregated view of how your system is performing right now. If the queue size is growing or the wait time for acquiring connections increases, it's a red flag that something's wrong. However, these metrics are often too high-level to diagnose the root cause. You know what happened and that you need to take action, but you don't understand why. LogsAlong with metrics, logs are traditionally used to track events in real-time. You'd log information like:
Example log entries for a connection pool might look like this:
Logs give you granular details about what's happening in the system at a specific point in time. They allow you to trace individual events, but because they're fragmented, you often must piece them together manually to understand how a task moves through the system. And if you tried to correlate logs on production for a complex system, you're no longer laughing in the circus. The Pitfalls of Metrics and Logs: Why They're Not EnoughAt first glance, metrics and logs seem sufficient for monitoring these systems, but they often don't give enough context to solve the problem quickly. They show you what's happening but not why or how. Let's break down some specific real-world challenges you might face with traditional metrics and logs. Issue 1: Connection Pool ExhaustionYou're monitoring a connection pool with a limit of 100 active connections, and suddenly, your metrics dashboard shows that all 100 connections are in use. The queue size is growing, and users are experiencing delays. Traditional metrics and logs might show you:
You now know the symptoms, but you still don't know the root cause:
To diagnose the problem with logs, you would need to stitch together events for each task manually:
This manual process is slow and inefficient, especially in high-concurrency systems. Issue 2: Long Query Execution TimeYou notice that queries take longer than expected to execute, contributing to the exhaustion of the connection pool. The logs show when a query was started and completed, but they don't tell you why it took so long. You might have logs that say:
But what's missing from this?
The logs provide fragmented data, but they don't let you see the entire request lifecycle—how long the task spent waiting in the queue, how long it held the connection, and why the query was slow. Issue 3: Task Starvation in QueuesImagine your system is queuing incoming requests, and you notice specific tasks get stuck in the queue for long periods. You're tracking queue size and connection usage, but something isn't adding up. Some tasks are being processed almost immediately, while others are waiting far too long. With metrics, you know the queue size is growing, but you can't see:
Logs only show you when tasks entered the queue and when they were processed. They don't provide real-time visibility into why tasks are being delayed. Moving Beyond Logs and Metrics with ObservabilityYou need observability to solve these problems and understand your system's entire lifecycle of requests. Observability goes beyond just gathering data - it's about understanding how different components interact and why things are happening. With observability tools like traces, spans, metrics, and span events, you can get a complete, real-time picture of what's happening inside your system. Let's explain how observability tools like traces, spans, and span events give you this deeper insight. What Is a Trace?A trace is like a call stack for distributed systems. It follows a request as it travels through multiple services and components, recording every important interaction along the way. For example, in a connection pooling system, a trace would capture:
Instead of isolated events, a trace links them together to show the complete lifecycle of a request. What Is a Span?A span represents a single step in that trace—a unit of work. Each span captures:
Think of spans as the building blocks of a trace. They give detailed insights into specific parts of the request lifecycle. The basic open telemetry setup can look as follows.
Don't worry. We'll get into the full instrumentation in the next article, but for now, let's focus on how we can use this data to investigate one of the issues outlined above. Let's walk through a detailed example of how to slice and dice data using OpenTelemetry and Grafana Tempo to diagnose a connection pool exhaustion issue. Tempo allows you to query traces and see the filtering results. Why Grafana Tempo? It's one of the most popular, mature OSS tooling, and I had to choose one for this article's needs. Other tools like Honeycomb are also capable of doing a similar investigations. I'll guide you step-by-step through drilling into the traces, correlating spans, and finding the root cause. I'll try to showcase to you the interactive troubleshooting session on how to discover and fix problems using observability tooling. Scenario: Diagnosing Connection Pool ExhaustionYou're noticing intermittent slowdowns in your system, and the connection pool is hitting its limit. Users are seeing delays, and your logs show that the queue size is growing. Traditional logs and metrics show some insights, but you're unsure which queries or tasks are causing the connection pool to be exhausted. Step 1: Visualising the Connection Pool Usage in GrafanaYou've already instrumented your connection pool and database queries with OpenTelemetry, and all the traces are being sent to Grafana Tempo. Your first task is visualising the traces related to connection pool usage and identifying unusually long connections. We can start by finding long-lived connections by querying all spans where the connection-acquired span took longer than 1 second to release the connection.
You may find that several tasks are holding onto connections for longer than 1 second, particularly during peak times. ... Unlock this post for free, courtesy of Oskar Dudycz. |
Older messages
Show me the money! Practically navigating the Cloud Costs Complexity
Monday, September 9, 2024
We've all seen cloud bills get out of hand, often because the true infrastructure costs are harder to predict than they seem at first glance. Today, we'll grab a calculator to discuss the costs
Using S3 but not the way you expected. S3 as strongly consistent event store.
Monday, September 2, 2024
The most powerful news usually comes surprisingly silent. AWS released a humble news: S3 now supports conditional writes. In the article I'll show you why is it groundbreaking and how powerful this
Webinar #21 - Michael Drogalis: Building the product on your own terms
Wednesday, August 28, 2024
Watch now | Did you have a brilliant idea for a startup but were afraid to try it? Or maybe you've built an Open Source tool but couldn't find a way to monetise it?How to be a solopreneur, a
Talk is cheap, show me the numbers! Benchmarking and beyond!
Monday, August 26, 2024
We're did a reality check to what we learned so far. I showed the real study from my projects performance analysis. We verified if connection pooling is indeed such important by giving the real
Architecture Weekly #191 - Is It Production Ready?
Tuesday, August 20, 2024
Why is connection pooling important? How can queuing, backpressure, and single-writer patterns help? In previous posts, we did a learning by-doing. We've built a simple connection pool, added
You Might Also Like
⚙️ Robo-suits
Tuesday, December 24, 2024
Plus: The data center energy surge
Apache Tomcat Vulnerability CVE-2024-56337 Exposes Servers to RCE Attacks
Tuesday, December 24, 2024
THN Daily Updates Newsletter cover The Data Science Handbook, 2nd Edition ($60.00 Value) FREE for a Limited Time Practical, accessible guide to becoming a data scientist, updated to include the latest
Edge 459: Quantization Plus Distillation
Tuesday, December 24, 2024
Some insights into quantized distillation ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Prepare for a Lifetime of Adventure with Rosetta Stone
Tuesday, December 24, 2024
The Perfect Gift For Every Traveler on Your List Rosetta Stone makes it easy to connect with the world in a whole new way. With a Lifetime Unlimited plan, users can access 25 languages to prepare for
Tuesday Triage #232
Tuesday, December 24, 2024
Your weekly crème de la crème of the Internet is here! The 232nd edition featuring fish traps, little Mussolinis, and volvelles. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Elastic Community Newsletter
Tuesday, December 24, 2024
Check out the latest from the Elastic Community ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ elastic | Search. Observe. Protect community-newsletter-header-img.png
Daily Coding Problem: Problem #1646 [Medium]
Monday, December 23, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Facebook. Write a function that rotates a list by k elements. For example, [1, 2, 3, 4,
GCP Newsletter #430
Monday, December 23, 2024
Welcome to issue #430 December 23rd, 2024 News Event Official Blog Calling all devs: Code the future of baseball with Google Cloud and MLB - Google Cloud and MLB are hosting a hackathon where
⏯️ Make a Holiday Guest Profile for Your Streaming Services — What Is Linux Mint?
Monday, December 23, 2024
Also: I Played the Worst Mobile Games So You Don't Have To, and More! How-To Geek Logo December 23, 2024 Did You Know The giant splashes of color that make poinsettias a popular holiday decoration
Ranked | The Most Satisfying vs. Most Reliable Car Brands in 2024 🚙
Monday, December 23, 2024
The most reliable car brands are rarely the most satisfying to own, according to recent Consumer Reports survey data. View Online | Subscribe | Download Our App Presented by: Find the megatrends