Dear founder,
“I didn’t see it coming.”
I had to admit that to myself a few times recently.
Over the last couple of weeks, I’ve been experiencing several issues with Podscan that only came to pass because I didn’t really have any observability on my system.
At least that’s what I know now.
🎧 Listen to this on my podcast.
Because it always takes a while to see the bottom part of an iceberg.
With Podscan, most scaling issues only show themselves in a delayed fashion; the problems that came up were consequences of issues much further down the line that happened much earlier. This experience has taught me that observability isn’t just a nice-to-have for my data software business — it’s crucial.
As I spend more time building this heavily database-driven, AI-based business on top of technologies that may be rather new and untested, I’m realizing the importance of robust observability. This is especially true given that I’m building an architecture I’ve never built before.
Podscan is the biggest thing I’ve ever done. Everything I learn is through doing, running into challenges, and facing them head-on.
So, let me share my early-stage learnings about system observability in this distributed, data-centric system of mine. Even if you don’t have a software business or don’t operate with millions of data feeds every single day, there’s still something insightful in here that I will, not just likely, but guaranteed, take into my future business efforts.
The Illusion of Visibility
Let’s talk about being over-confident if your ability to see problems.
For the longest time, I thought, “I’ll see when things go wrong, I’ll notice it.” But as my system got more complex, with more moving parts and individual components with varying scaling capacities, I realized I needed to find ways to either automatically detect and mitigate problems or recognize them early on.
Ideally, I want to spot trends, patterns, or moving thresholds, so I can see that if something continues running for a couple more weeks and slowly becomes more problematic, I’ll have an issue two weeks from now. This foresight —ideally— allows me to deal with potential problems proactively.
But if I knew what problems were to come, I’d make sure they never happen. Yet, issues arise.
If that level of prediction isn’t possible, then I need monitoring in place that immediately alerts me of a problem — either in the making or right at my doorstep. I’m trying to build all of these systems, but as usual with observability, one of the core problems is that often you don’t precisely know what to observe.
The Complexity of Modern Systems
I have a really sizable database, a search engine with its own database, backend servers with their own databases, and backend servers with no databases but a lot of cache. The question becomes: what do I need to observe to see if there are issues? Should I look into every single database, or are one or two enough for me to see if things might go out of whack size-wise, or if the data is corrupt or inserted incorrectly?
Obviously, observing every single thing everywhere is impossible due to performance reasons. So, the first question that always comes up is: what are the things that could cause trouble?
Identifying Potential Issues
Sometimes, potential issues are extremely clear right from the start. For example, when you’re taking a whole list of items and doing something with the entire list, that tends to be something that, once at scale, becomes problematic.
If you test it on your local computer with 100 or 1,000 items in your database, it’s clearly not a problem. There’s enough space in RAM. But what if you’re loading 100,000 items? Or 2 million items? There might still be enough space in your memory, but the operation on each of these items will consume memory and time. Is it going to take the one second it takes on your local computer, or is it going to take 10 minutes and block the database?
This might not be immediately obvious from the code, and you’ll have to learn this by recognizing it as things scale. But every single time I find such a glitch, I realize that I could have seen it coming.
Setting Up Monitoring Systems
And that’s a tooling question. I can prepare for this.
The important thing is to understand if there are easy targets right from the get-go that I should be monitoring. If so, I try to set up some kind of reporting system that has an intake where it can reliably push reporting information. There are many options for this, like Prometheus, Grafana, or even the ELK stack.
It’s always better to have data in your monitoring system than not to have it in there. Even when everything is perfectly fine.
For example, you could push the number of items in a particular database as a JSON object into your chosen system whenever you interact with the database. Then, you can use visualization tools to show either the number or a trending graph of where that number is going and what the delta is between the last time you checked and a week or two months ago.
And, depending on something very important, this information might make you sleep soundly or panic slightly.
The Importance of Context
The number alone won’t really help you. Its current value itself tends to be a very binary thing - either it’s bad or it’s good. If you have zero items in your database, that tends to be a problem. If you have 200 million items in a database, that might be a problem too. But it doesn’t really matter if it’s 5,000 or 6,000 or 20,000 or 30,000 - there’s a spectrum in between these extremes. Those thresholds —too little, too much, or just right— are for you to determine and constantly adjust. A newly founded SaaS startup might need a warning when the number of projects eclipses a few hundred, while a two-decade old SaaS business can easily expect thousands of new projects to be created every month. Revise your thresholds when they are reached recurringly.
Learning from Experience
Some issues aren’t as easily found. They’re things you learn as you build the business and the product. I experienced this a couple of months ago when I had a major memory leak in Podscan. It was debilitating, causing the RAM of my server (which was a pretty sizable machine with 30 gigabytes of RAM) to be exhausted very quickly.
Every time I restarted the process and the server, more and more memory would be consumed. Almost immediately. Fortunately, I found a way to dampen the leak so that I could spend my time investigating it.
I eventually traced it to my internal caching logic - my metrics caching that I built, ironically, to figure out when things go wrong. It was implemented in a way that would load a lot of unnecessary data into RAM and stick around for a bit. If enough processes were started with this data, they would cascade into a memory leak.
I fixed that by fixing the underlying caching logic, and everything returned to normal. But now, whenever I build something that’s doing an averaging or a summarization or adding an item to a list, I try to figure out how big that list can get, and if I can maybe stop the list from being too big, either by chunking or by using partials, by using pagination, wherever something is potentially massive in memory.
I’ve trained myself to see this as I code. Before that happened, it was like, “Oh, it’ll be fine. Those processes will eventually remove the memory.” And that was true - it was just that they were not fast enough with it. The overwhelm of the system came from me just having too many moving parts for the system to self-regulate.
I have found that chat-based LLM platforms like Anthropic’s ClaudeAI are pretty good at spotting performance bottlenecks in code. I sometimes throw the full source of a core component of my app into Claude and tell it to investigate the 5 biggest potential performance risks or bugs in that code. Just the mere approach to how Claude argues for its answer often allows me to see something I hadn’t thought of before. I certainly hope that IDEs will integrate this as a constant background process soon.
But for now, I still have to hunt for bottlenecks myself.
Common Issues in Complex Systems
Generally, the problems you run that might cascade into chaos are resource problems. It’s rarely ever a logic problem that causes massive issues. And even if it is, it then becomes a resource problem. Resource problems that are code-bound, where code impacts the performance of the system itself (not data integrity or accuracy), tend to be issues of:
- Compute power: You’re causing too much compute, and the system gets locked up.
- Memory availability: You have too many things in memory, causing the system to lock up.
- Disk issues: Either in terms of operations (writing or reading too much data that your disk can’t produce or persist fast enough) or size (running out of hard drive space).
If you add GPUs into the mix, the resource problem becomes even stronger because that’s a bottleneck for every system I’ve encountered.
The Power of Queue Systems
The solution to most of these problems is to have queue systems in place. You can queue almost anything, and every framework available allows you to queue background processes. PHP has Laravel Horizon, Ruby has Sidekiq, Python has Celery, and JS has BullMQ. You can queue any regular calculations or operations of your system unless they belong to a request that needs that data as an answer.
For example, if you’re building something that generates AI-based images in the background, you can put a placeholder loading image into your application and then fetch generated images as they’re created. You can do a lot of things in the background, restricting your queue to running only a couple of these operations at the same time to prevent runaway memory leaks or resource constraints that could affect other parts of your product.
Queue systems are also innately measurable. For Laravel, which I’m using for Podscan right now, the system is called Laravel Horizon. It comes with its own endpoint where you can see how many processes are running, what they’re doing, and how many are waiting in the queue.
These tools come out of the box with these features, so it’s definitely very useful to build queuing as a first-class citizen into your application. If your resources are doing alright, then the queue will be immediate anyway. If you have a resource problem at any point, the queue will help you deal with it until you find a solution.
Most queuing tools also have notification systems for when queues are overwhelmed or when there are too many items in any given queue. This observability is built-in from day one, and it doesn’t cost you anything.
The Importance of Historical Data
Whenever it comes to observability, you want to be able to see historical data from when things are doing well and from when they’re not, so you can compare. It doesn’t really help to only know the last 10 minutes of metrics when you deal with an avalanche of errors without knowing what the normal state looks like. Having access to historical data at any time will help you investigate problems and solutions to see if they’re doing what they’re supposed to be doing.
Personally, I try to track all my relevant system metrics on a per-minute basis, persisting them into a database that exists exclusively for those metrics. Every minute, a process runs that pulls the size of all the queues I have, the number of items I’ve cached over the last hour, the number of transcriptions I’ve run, the number of API requests I’ve had, the number of customer signups - all of these things exist in a database where every minute, I get new information about every single one of these numbers.
This has been extremely helpful for me to plot them out and graph them so I can, over time, see developments. It’s really interesting with Podscan to see the ups and downs of the transcripts per hour, not just the transcripts that I transcribe, but even just how many new podcast episodes I detect every hour. This changes throughout the day as new podcast episodes are usually released more between 9 AM and noon in the United States than they are at night.
In the morning and around noon, I get a massive avalanche of podcast episodes coming in, which affects my own queues, and I can see them growing. Then around the evening, those queues start to become really small, and I can go into the backlog. This creates new situations because there’s a different kind of quality that I want to use for transcribing them. Therefore, I can use more of my energy to transcribe more at the same time, which affects my stats and metrics.
And I know all of this because I have the data.
Learning from Mistakes
Recently, I’ve started adding a lot of things that I didn’t track before. For example, one of them is external systems that my system interacts with. Over the last week or so, I had a massive problem with my search database because one of my users told me that they had a problem with certain items missing from the database.
I traced it back to a queue on that database itself, on my Meilisearch instance, being overwhelmed. There were a couple million items in that queue. I didn’t really know why, and because I wasn’t tracking it, I didn’t know what the queue looked like a couple days ago or a month ago.
All I could really do was empty the queue, and in doing that, I had to re-import a significant number of items. Had I known the development, had I seen this number and been able to see where it was at any given point along the way, I probably could have dealt with this much earlier. I probably would not have run into a situation where, for more than a day or two, data wouldn’t have been synchronized.
This experience taught me that I needed to track this information and make it part of the logic in my application itself. If you have different kinds of queues and your application can only really understand one kind of queue, then if the other queue is not available or overwhelmed, it will keep sending data there.
So I taught the system to see both.
Now, I’m not sending new items to synchronize to my search database if it’s already handling a sizable queue. That queue needs to go below a certain threshold for me to send more items. This helps me with back pressure and overflow prevention, which I needed to build into my system to guarantee the interaction between my internal system (my server) and the external system (the search engine queuing).
This is very common for any distributed architecture. And since we’re talking about externalities, let’s talk about the fact that sometimes, things are out of your control.
The Importance of Alerting
Observability is great, but it’s not only about visualization and hoping that you find patterns. It’s also about literally alerting you and getting you out of whatever you’re doing if there’s an actual problem with your business or product — or one of the vendors you use.
From a single function inside your codebase failing to Amazon Web Services data centers being flooded, you need to know.
If the RAM of any server I’m operating on is over 80% for five minutes, I want to be informed. I want to be sent an email, just as much as if my domain goes down for a minute. Or when AWS has connectivity issues. For certain critical issues, I want to not only get an email but also be paged or get a call.
This alerting system recently helped me when I was re-importing all of this data into my search engine queue. I was importing so much data that my observability system for Horizon (which checks my local queue and application server) kept the succeeded jobs around for too long with too much data in them.
The queue was growing, and the RAM of the system was growing into the 80% range because the queue was configured to keep old, successful jobs and their data in memory. When you import tens of thousands of items at the same time, that’s a lot of stuff to keep in memory, even when it succeeds.
I got an email around 10 PM, just after I went to bed. I could very quickly jump out, figure out what was going on, make a configuration change to get those finished items out of memory almost immediately, and then go back to bed. Had I not done this, I probably could have experienced actual server faults because memory would have been so high that my application might have been starved for memory — and that’s when things tend to break down.
Avoided because of an alerting email.
Final Thoughts
Nobody has the means to observe all the things all the time. Make sure you see critical errors and make sure you don’t see non-critical ones. If you log everything, you will not look at it at all. And if you look at nothing, you won’t be able to track things as they happen.
Always look at how things might scale. Make sure you don’t overwhelm your system in the future by loading too much data, and be prepared to adjust your thresholds. Observability isn’t just about seeing what’s happening now - it’s about predicting and preventing issues before they become critical problems.
Building and maintaining a complex, distributed system is a journey of constant learning and adaptation.
A few years from now, I’ll probably have lots more to say about monitoring and alerting.
But it’s worth looking into today. At any scale. By implementing robust observability practices, you’re not just solving today’s problems — you’re setting yourself up for success as your system grows and evolves.
If you want to track your brand mentions on podcasts, please check out podscan.fm — and tell your friends!
Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.
If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!