When Logs and metrics aren't enough: Discovering Modern Observability
It's 2 a.m., and your phone rings. Your system is experiencing performance degradation. Tasks are piling up in the queues, users are seeing timeouts, and your monitoring dashboard shows that the connection pool is at its limit. Your first instinct? Check the logs. But traditional logging gives you just a fragmented picture: messages without the broader context. You're left wondering: What's really going on? Why are queues growing? Why are requests taking longer to process? Are there specific queries or requests that are slowing things down? And crucially, how do all these isolated problems connect? This is typically the moment when we realise that our system condition is not the easiest to diagnose. It'd be great if we had more data and could understand the context and causation of what has happened. Understanding not just what went wrong but why and how is one of the foundational features of an observable system—and a system that's not a nightmare to run in production. Nowadays, that's even more challenging. The complexity of our distributed systems is at its peak. Our problems are rarely isolated to one part of the stack. We need to correlate information from various places to diagnose the issues. In the last two releases, we took a break and analysed how S3 could be used for more than storing files and how much it costs. Today, let's return to the previous series and discuss the typical challenges of distributed systems. We'll continue to use managing a connection pool for database access as an example and discuss the issues and potential troubleshooting using modern observability tools and techniques like Open Telemetry, Grafana Tempo, etc. See more:
We'll explore how traditional monitoring approaches like logs and metrics fail and how observability—through traces, spans, metrics, and span events—gives you the full picture and helps you understand your system deeply. Connection PoolAs discussed in details earlier, a connection pool manages reusable database connections shared among multiple requests. Instead of opening and closing a connection for each request, we borrow a connection from the pool and return it after database operation. Thanks to that, we're cutting the latency required to establish a connection and get a more resilient and scalable solution. That's okay, but how do you know if it's working correctly? Let's start with the traditional approach to monitoring this setup: metrics and logs. Traditional Monitoring with Metrics and LogsMetricsIn a traditional system, metrics provide real-time, aggregated data that helps you track the overall health of your system. For example, in a connection pool, you could monitor the following key metrics:
You can imagine the basic metrics dashboard:
It could be updated in real-time, brighter and shinier, with colours and animations, etc. Still, the value will be similar to those three lines of text. Such a dashboard gives you an aggregated view of how your system is performing right now. If the queue size is growing or the wait time for acquiring connections increases, it's a red flag that something's wrong. However, these metrics are often too high-level to diagnose the root cause. You know what happened and that you need to take action, but you don't understand why. LogsAlong with metrics, logs are traditionally used to track events in real-time. You'd log information like:
Example log entries for a connection pool might look like this:
Logs give you granular details about what's happening in the system at a specific point in time. They allow you to trace individual events, but because they're fragmented, you often must piece them together manually to understand how a task moves through the system. And if you tried to correlate logs on production for a complex system, you're no longer laughing in the circus. The Pitfalls of Metrics and Logs: Why They're Not EnoughAt first glance, metrics and logs seem sufficient for monitoring these systems, but they often don't give enough context to solve the problem quickly. They show you what's happening but not why or how. Let's break down some specific real-world challenges you might face with traditional metrics and logs. Issue 1: Connection Pool ExhaustionYou're monitoring a connection pool with a limit of 100 active connections, and suddenly, your metrics dashboard shows that all 100 connections are in use. The queue size is growing, and users are experiencing delays. Traditional metrics and logs might show you:
You now know the symptoms, but you still don't know the root cause:
To diagnose the problem with logs, you would need to stitch together events for each task manually:
This manual process is slow and inefficient, especially in high-concurrency systems. Issue 2: Long Query Execution TimeYou notice that queries take longer than expected to execute, contributing to the exhaustion of the connection pool. The logs show when a query was started and completed, but they don't tell you why it took so long. You might have logs that say:
But what's missing from this?
The logs provide fragmented data, but they don't let you see the entire request lifecycle—how long the task spent waiting in the queue, how long it held the connection, and why the query was slow. Issue 3: Task Starvation in QueuesImagine your system is queuing incoming requests, and you notice specific tasks get stuck in the queue for long periods. You're tracking queue size and connection usage, but something isn't adding up. Some tasks are being processed almost immediately, while others are waiting far too long. With metrics, you know the queue size is growing, but you can't see:
Logs only show you when tasks entered the queue and when they were processed. They don't provide real-time visibility into why tasks are being delayed. Moving Beyond Logs and Metrics with ObservabilityYou need observability to solve these problems and understand your system's entire lifecycle of requests. Observability goes beyond just gathering data - it's about understanding how different components interact and why things are happening. With observability tools like traces, spans, metrics, and span events, you can get a complete, real-time picture of what's happening inside your system. Let's explain how observability tools like traces, spans, and span events give you this deeper insight. What Is a Trace?A trace is like a call stack for distributed systems. It follows a request as it travels through multiple services and components, recording every important interaction along the way. For example, in a connection pooling system, a trace would capture:
Instead of isolated events, a trace links them together to show the complete lifecycle of a request. What Is a Span?A span represents a single step in that trace—a unit of work. Each span captures:
Think of spans as the building blocks of a trace. They give detailed insights into specific parts of the request lifecycle. The basic open telemetry setup can look as follows.
Don't worry. We'll get into the full instrumentation in the next article, but for now, let's focus on how we can use this data to investigate one of the issues outlined above. Let's walk through a detailed example of how to slice and dice data using OpenTelemetry and Grafana Tempo to diagnose a connection pool exhaustion issue. Tempo allows you to query traces and see the filtering results. Why Grafana Tempo? It's one of the most popular, mature OSS tooling, and I had to choose one for this article's needs. Other tools like Honeycomb are also capable of doing a similar investigations. I'll guide you step-by-step through drilling into the traces, correlating spans, and finding the root cause. I'll try to showcase to you the interactive troubleshooting session on how to discover and fix problems using observability tooling. Scenario: Diagnosing Connection Pool ExhaustionYou're noticing intermittent slowdowns in your system, and the connection pool is hitting its limit. Users are seeing delays, and your logs show that the queue size is growing. Traditional logs and metrics show some insights, but you're unsure which queries or tasks are causing the connection pool to be exhausted. Step 1: Visualising the Connection Pool Usage in GrafanaYou've already instrumented your connection pool and database queries with OpenTelemetry, and all the traces are being sent to Grafana Tempo. Your first task is visualising the traces related to connection pool usage and identifying unusually long connections. We can start by finding long-lived connections by querying all spans where the connection-acquired span took longer than 1 second to release the connection.
You may find that several tasks are holding onto connections for longer than 1 second, particularly during peak times. ... Unlock this post for free, courtesy of Oskar Dudycz. |
Older messages
Show me the money! Practically navigating the Cloud Costs Complexity
Monday, September 9, 2024
We've all seen cloud bills get out of hand, often because the true infrastructure costs are harder to predict than they seem at first glance. Today, we'll grab a calculator to discuss the costs
Using S3 but not the way you expected. S3 as strongly consistent event store.
Monday, September 2, 2024
The most powerful news usually comes surprisingly silent. AWS released a humble news: S3 now supports conditional writes. In the article I'll show you why is it groundbreaking and how powerful this
Webinar #21 - Michael Drogalis: Building the product on your own terms
Wednesday, August 28, 2024
Watch now | Did you have a brilliant idea for a startup but were afraid to try it? Or maybe you've built an Open Source tool but couldn't find a way to monetise it?How to be a solopreneur, a
Talk is cheap, show me the numbers! Benchmarking and beyond!
Monday, August 26, 2024
We're did a reality check to what we learned so far. I showed the real study from my projects performance analysis. We verified if connection pooling is indeed such important by giving the real
Architecture Weekly #191 - Is It Production Ready?
Tuesday, August 20, 2024
Why is connection pooling important? How can queuing, backpressure, and single-writer patterns help? In previous posts, we did a learning by-doing. We've built a simple connection pool, added
You Might Also Like
📧 Building Async APIs in ASP.NET Core - The Right Way
Saturday, November 23, 2024
Building Async APIs in ASP .NET Core - The Right Way Read on: my website / Read time: 5 minutes The .NET Weekly is brought to you by: Even the smartest AI in the world won't save you from a
WebAIM November 2024 Newsletter
Friday, November 22, 2024
WebAIM November 2024 Newsletter Read this newsletter online at https://webaim.org/newsletter/2024/november Features Using Severity Ratings to Prioritize Web Accessibility Remediation When it comes to
➡️ Why Your Phone Doesn't Want You to Sideload Apps — Setting the Default Gateway in Linux
Friday, November 22, 2024
Also: Hey Apple, It's Time to Upgrade the Macs Storage, and More! How-To Geek Logo November 22, 2024 Did You Know Fantasy author JRR Tolkien is credited with inventing the main concept of orcs and
JSK Daily for Nov 22, 2024
Friday, November 22, 2024
JSK Daily for Nov 22, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component
Spyglass Dispatch: The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen
Friday, November 22, 2024
The Fate of Chrome • Amazon Tops Up Anthropic • Pros Quit Xitter • Brave Powers AI Search • Apple's Lazy AI River • RIP Enrique Allen The Spyglass Dispatch is a free newsletter sent out daily on
Charted | How the Global Distribution of Wealth Has Changed (2000-2023) 💰
Friday, November 22, 2024
This graphic illustrates the shifts in global wealth distribution between 2000 and 2023. View Online | Subscribe | Download Our App Presented by: MSCI >> Get the Free Investor Guide Now FEATURED
Daily Coding Problem: Problem #1616 [Easy]
Friday, November 22, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Alibaba. Given an even number (greater than 2), return two prime numbers whose sum will
The problem to solve
Friday, November 22, 2024
Use problem framing to define the problem to solve This week, Tom Parson and Krishna Raha share tools and frameworks to identify and address challenges effectively, while Voltage Control highlights
Issue #568: Random mazes, train clock, and ReKill
Friday, November 22, 2024
View this email in your browser Issue #568 - November 22nd 2024 Weekly newsletter about Web Game Development. If you have anything you want to share with our community please let me know by replying to
Whats Next for AI: Interpreting Anthropic CEOs Vision
Friday, November 22, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 22, 2024? The HackerNoon