Dear founder,
As founders, we often focus on scaling our businesses in terms of customers, revenue, or team size.
But what happens when the data your business relies on scales faster than anything else?
That’s the challenge I very recently faced with Podscan.fm, and it nearly brought the entire system to its knees.
Nearly!
If you’re building a business that deals with large volumes of external data, buckle up – this story might save you from some sleepless nights.
🎧 Listen to this on my podcast.
Three crucial lessons before we dive in — all heard-earned learnings from getting things wrong initially:
- Observability is king: From day one, implement robust logging and monitoring. You need to know exactly what your system is doing at all times, especially when dealing with data at scale.
- Queuing systems are your best friends: Build systems that can handle pressure without crumbling. Message queues and job workers are essential for managing overwhelming workloads.
- Database interactions matter: Be extremely careful with database queries, especially as your data grows. Simple operations like counting items can become major bottlenecks.
You can probably imagine that I, a self-proclaimed “at-best-1x-probably-0.5x-developer”, ran into each of these issues head-first while building Podscan. And you’d be right.
Now, let me take you through the rollercoaster of the past few weeks, where a single overlooked bug cascaded into a full-blown crisis – and how I clawed my way back to stability.
The Calm Before the Storm
Podscan.fm does three main things:
- It ingests data by scanning all podcasts out there for new episodes.
- It downloads and transcribes these episodes.
- It makes the information from transcripts available to users through alerts, APIs, and webhooks.
In essence: collection, transcription, and distribution. I’ve been juggling these three balls since the beginning, and for a while, things were running smoothly. We had a system that could handle hundreds of thousands of podcasts, randomly checking throughout the day for new episodes.
But as we grew, cracks started to appear. Our main application server, responsible for both podcast checking and serving web requests, began to strain under the load — even as the transcription work was already delegated to a standalone backend server fleet. But with millions of podcasts to monitor, the sheer volume was overwhelming our resources on the main server. It was clear: we needed a more scalable solution.
The Grand Plan: Divide and Conquer
Three weeks ago, I decided to rebuild a critical part of Podscan.fm. The idea was simple: create a dedicated microservice for podcast feed checking. This would take the load off our main server and allow us to scale the checking process independently. Instead of having to cut corners and only check certain feeds every day, I could spin up three checking servers and check most podcasts several times a day!
I was excited. This new system would be distributed, running on multiple servers in different locations. It would be optimized for one task: constantly scouring the web for new podcast episodes. No more performance impacts on our main application. It seemed perfect.
As I coded, I even added some fancy logic to determine the best times and frequencies for checking each feed. I felt like a true optimization wizard.
Little did I know, I had just planted a time bomb in our system.
The Silent Killer: A Bug in Disguise
Here’s where scale becomes a truly wicked problem. When you’re testing with a few hundred or even a thousand items, everything can look fine. If you see roughly the right number of log entries, you assume all is well and your code is working.
But what if you’re missing 20% of your checks? Or 50%? Or even 70%? It’s not always obvious when you’re dealing with large numbers.
That’s exactly what happened. My new scheduling logic had a subtle flaw that caused a substantial number of podcasts to be checked far less frequently than intended – or sometimes not at all. But because we were still ingesting tens of thousands of new episodes daily, I didn’t notice the problem when the code was deployed.
Confident in my “working” system, I moved on to other tasks. I spent a week improving our data extraction system, completely oblivious to the ticking time bomb.
The Unraveling
It started with a few user reports. Some customers noticed their favorite shows weren’t updating as frequently. Most just used our manual update feature as a workaround. But one user, Nikita who runs Masterplan and is using Podscan to build a medical education platform offering concise summaries of leading medical podcasts, went above and beyond. He meticulously documented the missing episodes and feeds, presenting me with undeniable evidence that something was very wrong.
At first, I brushed it off as isolated incidents. But Nikita was relentless, providing regular updates on the state of these problematic feeds. As the data piled up, I could no longer ignore the truth: our core ingestion system was fundamentally broken.
I dug into the logs, and what I found chilled me to the bone. Feeds that should have been checked multiple times a day weren’t being called at all. I even built code fragments into my system to provide extra logging when Nikita’s podcasts were handled. Nothing. They weren’t even attempted to be scanned.
My “optimized” scheduling logic was failing spectacularly.
The Moment of Truth
One morning, I woke up with a realization. I had dreamt of math, after so many days of not understanding why things had broken. I rushed to my computer, tested my hypothesis, and felt a mix of relief and dread wash over me. I had found the bug – a simple math error in how we scheduled checks throughout the day.
For nearly two weeks, we had been operating at 30-40% capacity without realizing it. I quickly rewrote the checking logic —using ClaudeAI to help me spot errors I might have missed this time— , implemented extensive logging, and held my breath as I deployed the fix.
Over the next few days, I watched with growing excitement as our system started behaving correctly again. Feeds were being reliably checked every 4-6 hours, just as intended. Even Nikita reached out to tell me his alerts were coming in regularly again.
But my elation was short-lived. I had solved one problem, only to create a much bigger one.
The Flood
Remember all those missed podcast episodes from the past two weeks? They were about to hit our system like a tidal wave.
In a single day, we experienced an influx of content that would normally be spread over two weeks. Our carefully provisioned systems, sized for what we thought was normal load, were suddenly overwhelmed.
- Transcription queues overflowed
- Data extraction services buckled under the strain
- Alert systems fell behind, frustrating our users (and me!)
It was like watching a series of dominoes fall. Each part of our pipeline that had been running smoothly at partial capacity now faced a 400-500% increase in workload.
And here’s the kicker: unlike many SaaS businesses that scale with paying customers, Podscan’s core ingestion work is relatively static. We aim to process all podcasts, regardless of our customer count. This means we can’t simply throttle or delay processing – our value proposition depends on comprehensive, timely coverage.
Most software businesses have the luxury to scale their operations along a metric they can influence: either, it’s the number of customers, of the number of projects, maybe even files hosted on a platform. But it’s always a number that increases in tandem with the business itself.
But the moment you work with external data —data created by others but desired by your customers— you run into scaling problems very quickly.
My friends over at Fathom Analytics can probably tell a few stories about this as well. They’re bootstrapping a Google Analytics alternative, and they have customers who have millions of pageviews per day. Maybe even per hour! THAT is external data — it comes at a volume that you have no control over, but you have to support it to keep that customer.
And that was what I had to do. Stabilize things. Keep my customers. Show that I can handle the millions of podcasts out there.
Fighting the Fire
The next few weeks were a blur of firefighting and optimization:
- Rebalancing resources: I shifted servers from transcription to extraction, trying to clear the backlog without completely starving other processes.
- Scaling AI services: Our context-aware alerting system, which uses AI to filter relevant mentions, was suddenly processing 5x the normal volume. I had to quickly provision additional AI resources while keeping an eye on costs — mind you, spinning up new servers isn’t an option right now. So I had to take resources from transcription and extraction, further skewing the balance of the whole system.
- Queue management: I implemented more sophisticated priority queues, ensuring that the most critical podcasts (based on user interest and update frequency) were processed first. This is a tough one: it’s hard to ignore a podcast just because it’s not popular. But for the overall health of the system, it was unavoidable.
- Adaptive systems: I built checks into the system that could automatically adjust to load fluctuations, preventing future backlogs from spiraling out of control.
- Database optimizations: As our data volume swelled, even simple count queries began to slow down. I had to refactor several database interactions to maintain performance.
Throughout this process, I was acutely aware of the (semi-)bootstrapped nature of Podscan. I couldn’t just throw unlimited resources at the problem. Every optimization had to balance performance with cost-effectiveness.
Lessons Learned
As the dust settles and Podscan regains stability, I’m left with several hard-won insights:
- Expect the unexpected scale: When dealing with external data sources, your scale isn’t determined by your customer count. Plan for the full scope of data you might encounter, and provision resources that can handle it, even when things are a bit shaky.
- Test at true scale: Don’t be fooled by tests on subset data. Find ways to validate your systems against realistic data volumes — and you have to understand that even a small percentage of a large number of items is itself a large number of items. Scale is hard for humans, particularly when it’s all digital information.
- Implement circuit breakers: Build safeguards that can detect and mitigate unusual spikes in data volume or processing time. At the very least, they should get your attention so you can intervene.
- Decouple critical systems: Our problems were compounded because a slowdown in one area (ingestion) cascaded to user-facing features (alerts). Design your architecture to isolate potential failure points.
- Invest in observability: The sooner you can detect anomalies, the easier they are to fix. Comprehensive logging and monitoring are not optional at this scale. Once you have a system up and running, compare current data against expected data, track error rates, and have them reported to you when things misbehave.
- Build flexible infrastructure: The ability to quickly reallocate resources between services saved us from complete meltdown. Design your systems with this flexibility in mind. The code for the transcription server, the extraction logic, and the context-aware question inference is all run on the same application, so all I had to do was to change a configuration value to shift resources. I could even automated that depending on load.
- Understand your bottlenecks: Every system has limits. Know where yours are and have plans to address them before they become critical. Graph your performance data, and keep and eye on systems that are at capacity.
The Road Ahead
This experience has been humbling, but it’s also rekindled my passion for building resilient, scalable systems. Podscan is now more robust than ever, with clearer insights into our operational limits and capabilities. All it cost me was two weeks and a severe chunk of my sanity.
For those of you building data-intensive businesses, remember: the challenges you’ll face aren’t always obvious from the start. Stay vigilant, be prepared to adapt quickly, and never stop learning from your system’s behavior.
Building at this scale is not for the faint of heart. But for those willing to embrace the complexity, the rewards – in terms of technical knowledge and the ability to provide unique value to users – are immeasurable.
Now, if you’ll excuse me, I have a few million podcasts to process.
If you want to track your brand mentions on podcasts, please check out podscan.fm — and tell your friends!
Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.
If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!