͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

When Logs and metrics aren't enough: Discovering Modern Observability

Oskar Dudycz

Sep 16

∙

Preview

READ IN APP

Catherine Coulson Dead: 'Twin Peaks' Log Lady Dies at 71

It's 2 a.m., and your phone rings. Your system is experiencing performance degradation. Tasks are piling up in the queues, users are seeing timeouts, and your monitoring dashboard shows that the connection pool is at its limit. Your first instinct? Check the logs. But traditional logging gives you just a fragmented picture: messages without the broader context.

You're left wondering: What's really going on? Why are queues growing? Why are requests taking longer to process? Are there specific queries or requests that are slowing things down? And crucially, how do all these isolated problems connect?

This is typically the moment when we realise that our system condition is not the easiest to diagnose. It'd be great if we had more data and could understand the context and causation of what has happened.

Understanding not just what went wrong but why and how is one of the foundational features of an observable system—and a system that's not a nightmare to run in production. Nowadays, that's even more challenging. The complexity of our distributed systems is at its peak. Our problems are rarely isolated to one part of the stack. We need to correlate information from various places to diagnose the issues.

In the last two releases, we took a break and analysed how S3 could be used for more than storing files and how much it costs. Today, let's return to the previous series and discuss the typical challenges of distributed systems.

We'll continue to use managing a connection pool for database access as an example and discuss the issues and potential troubleshooting using modern observability tools and techniques like Open Telemetry, Grafana Tempo, etc.

See more:

We'll explore how traditional monitoring approaches like logs and metrics fail and how observability—through traces, spans, metrics, and span events—gives you the full picture and helps you understand your system deeply.

Connection Pool

As discussed in details earlier, a connection pool manages reusable database connections shared among multiple requests. Instead of opening and closing a connection for each request, we borrow a connection from the pool and return it after database operation.

Thanks to that, we're cutting the latency required to establish a connection and get a more resilient and scalable solution.

That's okay, but how do you know if it's working correctly? Let's start with the traditional approach to monitoring this setup: metrics and logs.

Traditional Monitoring with Metrics and Logs

Metrics

In a traditional system, metrics provide real-time, aggregated data that helps you track the overall health of your system. For example, in a connection pool, you could monitor the following key metrics:

Active connections: This tells you how many database connections are currently in use out of the total available in the pool. It shows you whether the pool is nearing its maximum capacity.
Queue size: The number of tasks waiting for a connection when all the active connections are in use. If the queue size starts growing rapidly, it's a sign that your system is under stress and tasks are getting delayed.
Connection wait time: The average time a task spends waiting in the queue before acquiring a connection. This helps you understand the latency introduced by your connection pool and gives you a signal when tasks are waiting too long.

You can imagine the basic metrics dashboard:

Active connections: 95/100
Queue size: 12 tasks
Connection wait time: 200ms

It could be updated in real-time, brighter and shinier, with colours and animations, etc. Still, the value will be similar to those three lines of text.

Such a dashboard gives you an aggregated view of how your system is performing right now. If the queue size is growing or the wait time for acquiring connections increases, it's a red flag that something's wrong. However, these metrics are often too high-level to diagnose the root cause. You know what happened and that you need to take action, but you don't understand why.

Logs

Along with metrics, logs are traditionally used to track events in real-time. You'd log information like:

When a task enters the queue.
When a connection is opened.
When a query is executed.
When a connection is released.

Example log entries for a connection pool might look like this:

INFO: Task enqueued at 12:01:03, queue size: 10
INFO: Connection acquired by Task 101 at 12:01:05
INFO: Query executed for Task 101 at 12:01:10, query time: 500ms
INFO: Connection released by Task 101 at 12:01:12

Logs give you granular details about what's happening in the system at a specific point in time. They allow you to trace individual events, but because they're fragmented, you often must piece them together manually to understand how a task moves through the system. And if you tried to correlate logs on production for a complex system, you're no longer laughing in the circus.

The Pitfalls of Metrics and Logs: Why They're Not Enough

At first glance, metrics and logs seem sufficient for monitoring these systems, but they often don't give enough context to solve the problem quickly. They show you what's happening but not why or how. Let's break down some specific real-world challenges you might face with traditional metrics and logs.

Issue 1: Connection Pool Exhaustion

You're monitoring a connection pool with a limit of 100 active connections, and suddenly, your metrics dashboard shows that all 100 connections are in use. The queue size is growing, and users are experiencing delays.

Traditional metrics and logs might show you:

All connections are in use.
The queue size is growing.
Connection wait times are increasing.

You now know the symptoms, but you still don't know the root cause:

Are there certain queries that are holding onto connections longer than expected?
Which tasks are being delayed in the queue, and how long have they been waiting?
Are there specific requests taking unusually long to complete?

To diagnose the problem with logs, you would need to stitch together events for each task manually:

Find the log where the task entered the queue.
Find the log where the connection was acquired.
Track the query execution time from separate logs.
Cross-reference multiple logs to figure out which requests are slow.

This manual process is slow and inefficient, especially in high-concurrency systems.

Issue 2: Long Query Execution Time

You notice that queries take longer than expected to execute, contributing to the exhaustion of the connection pool. The logs show when a query was started and completed, but they don't tell you why it took so long. You might have logs that say:

INFO: Query executed for Task 102 at 12:02:00, query time: 2000ms
INFO: Connection released by Task 102 at 12:02:03

But what's missing from this?

Why did the query take so long?
Is it related to specific database tables or queries?
Is it a systemic issue or just a one-off slow query?

The logs provide fragmented data, but they don't let you see the entire request lifecycle—how long the task spent waiting in the queue, how long it held the connection, and why the query was slow.

Issue 3: Task Starvation in Queues

Imagine your system is queuing incoming requests, and you notice specific tasks get stuck in the queue for long periods. You're tracking queue size and connection usage, but something isn't adding up. Some tasks are being processed almost immediately, while others are waiting far too long.

With metrics, you know the queue size is growing, but you can't see:

Which tasks are stuck in the queue the longest?
Are certain tasks being prioritised incorrectly?
What's the actual delay for each individual request?

Logs only show you when tasks entered the queue and when they were processed. They don't provide real-time visibility into why tasks are being delayed.

Moving Beyond Logs and Metrics with Observability

You need observability to solve these problems and understand your system's entire lifecycle of requests. Observability goes beyond just gathering data - it's about understanding how different components interact and why things are happening. With observability tools like traces, spans, metrics, and span events, you can get a complete, real-time picture of what's happening inside your system.

Let's explain how observability tools like traces, spans, and span events give you this deeper insight.

What Is a Trace?

A trace is like a call stack for distributed systems. It follows a request as it travels through multiple services and components, recording every important interaction along the way.

For example, in a connection pooling system, a trace would capture:

When a request enters the queue.
When the request acquires a connection.
When the request makes a database query.
When the connection is released back to the pool.

Instead of isolated events, a trace links them together to show the complete lifecycle of a request.

What Is a Span?

A span represents a single step in that trace—a unit of work. Each span captures:

Start and end times: When did the operation start and finish?
Attributes: Metadata that adds context to the span (e.g., query type, queue size).
Events: Milestones or important events that happen within the span (e.g., query failed, connection acquired).

Think of spans as the building blocks of a trace. They give detailed insights into specific parts of the request lifecycle.

The basic open telemetry setup can look as follows.

import { trace } from "@opentelemetry/api";

// setup tracer for connection pool
const tracer = trace.getTracer("connection-pool-tracer");

export class SimpleConnectionPool {
  // (...) rest of the code

  connect(): Promise<PooledConnection> {
    // start new span
    const span = tracer.startSpan("connect-to-database");

    return this.queueBroker
      .enqueue(({ ack }) => {
        const connection = this.openConnection(ack);
        // add attributes
        span.setAttribute("pool-size", this.activeTasks);
        return Promise.resolve(connection);
      })
      .then((connection) => {
        span.end(); // End span when connection is acquired
        return connection;
      })
      .catch((error) => {
        span.setStatus({ code: 2, message: error.message });
        span.end();
        throw error;
      });
  }
}

Don't worry. We'll get into the full instrumentation in the next article, but for now, let's focus on how we can use this data to investigate one of the issues outlined above.

Let's walk through a detailed example of how to slice and dice data using OpenTelemetry and Grafana Tempo to diagnose a connection pool exhaustion issue. Tempo allows you to query traces and see the filtering results. Why Grafana Tempo? It's one of the most popular, mature OSS tooling, and I had to choose one for this article's needs. Other tools like Honeycomb are also capable of doing a similar investigations.

I'll guide you step-by-step through drilling into the traces, correlating spans, and finding the root cause. I'll try to showcase to you the interactive troubleshooting session on how to discover and fix problems using observability tooling.

Scenario: Diagnosing Connection Pool Exhaustion

You're noticing intermittent slowdowns in your system, and the connection pool is hitting its limit. Users are seeing delays, and your logs show that the queue size is growing. Traditional logs and metrics show some insights, but you're unsure which queries or tasks are causing the connection pool to be exhausted.

Step 1: Visualising the Connection Pool Usage in Grafana

You've already instrumented your connection pool and database queries with OpenTelemetry, and all the traces are being sent to Grafana Tempo. Your first task is visualising the traces related to connection pool usage and identifying unusually long connections.

We can start by finding long-lived connections by querying all spans where the connection-acquired span took longer than 1 second to release the connection.

span.duration > 1000ms and name = "connection-acquired"

You may find that several tasks are holding onto connections for longer than 1 second, particularly during peak times. ...

Unlock this post for free, courtesy of Oskar Dudycz.

Claim my free post

Or upgrade your subscription. Upgrade to paid

Like

Comment

Restack

Show me the money! Practically navigating the Cloud Costs Complexity

Monday, September 9, 2024

We've all seen cloud bills get out of hand, often because the true infrastructure costs are harder to predict than they seem at first glance. Today, we'll grab a calculator to discuss the costs

Using S3 but not the way you expected. S3 as strongly consistent event store.

Monday, September 2, 2024

The most powerful news usually comes surprisingly silent. AWS released a humble news: S3 now supports conditional writes. In the article I'll show you why is it groundbreaking and how powerful this

Webinar #21 - Michael Drogalis: Building the product on your own terms

Wednesday, August 28, 2024

Watch now | Did you have a brilliant idea for a startup but were afraid to try it? Or maybe you've built an Open Source tool but couldn't find a way to monetise it?How to be a solopreneur, a

Talk is cheap, show me the numbers! Benchmarking and beyond!

Monday, August 26, 2024

We're did a reality check to what we learned so far. I showed the real study from my projects performance analysis. We verified if connection pooling is indeed such important by giving the real

Architecture Weekly #191 - Is It Production Ready?

Tuesday, August 20, 2024

Why is connection pooling important? How can queuing, backpressure, and single-writer patterns help? In previous posts, we did a learning by-doing. We've built a simple connection pool, added

📧 Did you watch the FREE chapter of Pragmatic REST APIs?

Friday, February 28, 2025

Hey, it's Milan. 👋 The weekend is almost upon us. So, if you're up for some quality learning, consider watching the free chapter of Pragmatic REST APIs. Scroll down to the curriculum or click

Data Science Weekly - Issue 588

Thursday, February 27, 2025

Curated news, articles and jobs related to Data Science, AI, & Machine Learning ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

💎 Issue 458 - Why Ruby on Rails still matters

Thursday, February 27, 2025

This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular Ruby news, articles and

📱 Issue 452 - Three questions about Apple, encryption, and the U.K

Thursday, February 27, 2025

This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 452 Release Date Feb 27, 2025 Your weekly report of the most popular iOS news, articles and projects Popular

💻 Issue 451 - .NET 10 Preview 1 is now available!

Thursday, February 27, 2025

This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 451 Release Date Feb 27, 2025 Your weekly report of the most popular .NET news, articles and projects

💻 Issue 458 - Full Stack Security Essentials: Preventing CSRF, Clickjacking, and Ensuring Content Integrity in JavaScript

Thursday, February 27, 2025

This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular Node.js news, articles and

When Logs and metrics aren't enough: Discovering Modern Observability

When Logs and metrics aren't enough: Discovering Modern Observability

Connection Pool

Traditional Monitoring with Metrics and Logs

Metrics

Logs

The Pitfalls of Metrics and Logs: Why They're Not Enough

Issue 1: Connection Pool Exhaustion

Issue 2: Long Query Execution Time

Issue 3: Task Starvation in Queues

Moving Beyond Logs and Metrics with Observability

What Is a Trace?

What Is a Span?

Scenario: Diagnosing Connection Pool Exhaustion

Step 1: Visualising the Connection Pool Usage in Grafana

Unlock this post for free, courtesy of Oskar Dudycz.

Older messages

Show me the money! Practically navigating the Cloud Costs Complexity

Using S3 but not the way you expected. S3 as strongly consistent event store.

Webinar #21 - Michael Drogalis: Building the product on your own terms

Talk is cheap, show me the numbers! Benchmarking and beyond!

Architecture Weekly #191 - Is It Production Ready?

You Might Also Like

📧 Did you watch the FREE chapter of Pragmatic REST APIs?

Data Science Weekly - Issue 588

💎 Issue 458 - Why Ruby on Rails still matters

📱 Issue 452 - Three questions about Apple, encryption, and the U.K

💻 Issue 451 - .NET 10 Preview 1 is now available!

💻 Issue 458 - Full Stack Security Essentials: Preventing CSRF, Clickjacking, and Ensuring Content Integrity in JavaScript

💻 Issue 458 - TypeScript types can run DOOM

💻 Issue 453 - Linus Torvalds Clearly Lays Out Linux Maintainer Roles Around Rust Code

💻 Issue 376 - Top 10 React Libraries/Frameworks for 2025 🚀

February 27th 2025