One interesting aspect of our work in fixing the horrifying AWS bill is that we inadvertently stumble into the midst of various organizations' disaster recovery plans. "Turn off the DR site" may be sensible from an API perspective (after all, it's generally a bunch of idle resources!), but it's not at all tenable from a business perspective.

This week, we explore the basic problem with DR plans in general.

Have you heard about ChaosSearch, the fully managed log analytics platform that leverages your Amazon S3 as a data store, with no further data movement required? According to the CTO at Armor, a global cybersecurity company, “ChaosSearch is a critical piece of our infrastructure for processing terabytes per day of our customers’ log data.” And from Hubspot: “We are able to process and analyze terabytes a day of Cloudflare log data to identify and fend off DDoS attacks on behalf of our 76,000 customers at a fraction of the cost of our previous self-hosted ELK Stack.” So take it from me, Corey Quinn, or take it from the ChaosSearch customers - either way, take a look at ChaosSearch today! Sponsored

Your Disaster Recovery Plan is a Joke Written by Clowns

If you take a look somewhere in an engineering VP or Director's office, you'll find a binder that hasn't been touched in a while labeled "DR / BCM Plan."

Disaster Recovery / Business Continuity Management planning are important things to take into consideration. But the clowns you work for have almost certainly screwed it up well into the realm of absurdity.

Why these plans exist

These plans start with the best of intentions. "What happens if our site falls over?" is absolutely the kind of thing that responsible businesses and also Facebook need to ask. In fact, when I was shopping around for The Duckbill Group's insurance policy, one of the questions was "Do you have a DR plan?" The next sentence was "please attach a copy of it," so you couldn't just skate by.

Further, if your data center or cloud service provider reaches out with a "Hey, so our facility is now a smoking hole in the ground because it turns out that powering it with what is in effect a giant compressed bomb had some failure modes we didn't fully anticipate," you're going to want to have at least a rough idea of what to do next.

No, not "update your résumé and look for a new job," you coward. We'll get to that part later.

Why they're jokes

The problem with these plans is that they betray a severe lack of understanding about how failures work. As an environment grows and its applications become world-spanning, it's less a question of whether the site is up or down and more a question of "How down is it?"

Knowing when to activate the DR plan is never as clear-cut as it is in tabletop exercises. If your provider fails to communicate with you about what's going on do you activate the plan or try to wait it out?

DR plans also suffer from the conceit that they're able to predict the scale and scope of any given outage. "Surely if the database server fails, it won't do so in a manner that corrupts its replica" is one expression of this, and a common one.

But there's a darker one.

If you're in AWS's us-tirefire-1 and you test your plan to make the poor life decision of migrating to Ohio, that's going to work pretty well during your DR exercise. It's likely to work far less well in the event of a regional AWS outage because roughly half of the internet will be attempting to do the exact same thing.

Did your DR plan account for EC2 instance provisioning to take 45 minutes? Did it account for EBS latency well above normal? The "herd of elephants" problem will stampede you to death if you're not careful, and there's no good way to test for this in advance.

DR plans are also snapshots of fixed points in time. If you're at a shop that does quarterly DR tests—spoiler: almost none do, despite what they claim in their audit attestations—what happens is you attempt to run the DR plan from last quarter and it runs into a problem and fails. You fix that, move forward another step or two, and hit a different problem. You keep iterating on your DR plan until it works, and you get to check the blessed box on the form.

And then your next commit to production breaks your DR plan again.

Unless you're testing your DR plan continually, it's almost certainly going to break in hilarious fashion right when you need it most.

Scope

Any DR plan that isn't written by complete clowns is going to have to address up front exactly what the scale and scope of its applicability is. "We lost the primary database" is a common and great example of what your DR plan should cover. "Three quarters of the world is destroyed by an asteroid" is going to have different answers—and for almost all of us, our sites will be down because we'll all have bigger problems to worry about for the foreseeable future.

Even things in the middle of these two extremes—such as "AWS loses a major region for a month"—are likely to be hilariously out of touch just because they fail to account for human behaviors.

The human element

I once worked in a regulated environment where I was a key employee with respect to the DR plan. "Here's our offsite location well outside of San Francisco in case the city isn't able to sustain work; in that event we'll all rendezvous here within four hours of the disaster being declared."

Unless this is your first encounter with my personality, you can probably guess how that conversation went.

"Yes, excuse me! One question for you folks, and it's just a minor thing really. None of our computers live in San Francisco; they're cloud hosted very far away in undisclosed locations managed by AWS. Can you identify a single scenario—any scenario at all—in which AWS lost a region, San Francisco was uninhabitable for work purposes, and a single employee here gave anything remotely resembling a crap about work instead of, y'know, their families? Further, let's assume that this hit-the-lottery-jackpot-three-weeks-in-a-row scenario happens; exactly which of our employees do you believe are dumb enough to continue working for their existing salaries rather than becoming multi-million dollar a month consultants for a number of companies who suddenly have far, far, far more expensive problems than we will? I don't recall 'hire people who are incredibly intelligent about everything except knowing their own market worth' as being in our charter. Did I miss that paragraph?"

And then, suddenly, I wasn't invited to DR planning meetings anymore.

At some point, "this is ridiculous; I quit" is going to be your staff's response—and they'll be right.

DR plans tend to skip over this entirely and lose sight of the bigger picture. Sure, okay—you have a policy that three of your executives can’t all travel on the same plane (strangely, there’s no such policy about them riding in the same car), but half of your engineering team will quit the second you mention Azure.

Our DR policy

The Duckbill Group's DR policy states, in effect, that we back up our data a couple of different ways. We're fully remote, so should any employee's internet stop working, they can presumably work from a coffee shop or tether from a phone. Should the multiple cities in which our Cloud Economists reside suddenly become unsuitable for work, we are prepared to operate on the assumption that nobody is going to care overly much about their AWS bills that month.

In effect, we take a realistic view that doesn't depend upon our employees sacrificing themselves or their families' well being in extremis. We didn't expect Pete Cheslock to keep working after I messed up drop-shipping his company car because we're human beings. At some scale, you’ve gotta have a business continuity plan that transcends individuals—heck, we do ourselves!—but that flat out can’t come at the expense of overlooking people’s basic humanity.

If your employer's DR plan is written by clowns and assumes you'll prioritize them over your family, I suggest you find a new place to work.

Trend Micro Cloud One. It’s a security services platform for organizations building in the cloud. It’s also an automated, flexible, all-in-one solution to protect workflows and containers with cloud-native security. But to you... it’s more time to focus on what you do best— building great applications. Learn moreSponsored

I’m Corey Quinn

I help companies address their horrifying AWS bills by both reducing the dollars spent and helping them understanding what they’re paying for.

Screaming in the Cloud & AWS Morning Brief

In addition to this newsletter, I host two podcasts: Screaming in the Cloud, about the business of cloud computing, featuring me talking to folks who are good at things; and AWS Morning Brief, a show about exclusively AWS with my snark at full-tilt.

Sponsor an Issue

Reach over 19,000 discerning engineers, managers, and enthusiasts who actually care about the state of Amazon's cloud ecosystems.

Want to skip these Last Week in AWS Extras? Click here and you won't receive these Wednesday dispatches anymore.

To make sure you keep getting these emails, please add corey@lastweekinaws.com to your address book or otherwise mark me as a permitted sender.

Want out of the loop completely? Click here to tell me to leave you alone.

Duckbill Group

1728 Ocean Ave #307, San Francisco, CA 94112

[Last Week in AWS Extras]: Multi-Cloud is the Worst Practice

Your Disaster Recovery Plan is a Joke Written by Clowns

Why these plans exist

Why they're jokes

Scope

The human element

Our DR policy

I’m Corey Quinn

Screaming in the Cloud & AWS Morning Brief

Sponsor an Issue

Older messages

[Last Week in AWS] Issue #174: Don't Hate the Player; Hate the Name

[Last Week in AWS Extras]: Multi-Cloud is the Worst Practice

[Last Week in AWS] Issue #173: Drastic Load Balancing Code Changes

[Last Week in AWS Extras]: Amazon Interactive Video Service: An Economic Analysis

[Last Week in AWS] Issue #172: AWS re:Lease The Kraken

You Might Also Like

Simplification Takes Courage & Perplexity introduces Comet

Mapped | Which Countries Are Perceived as the Most Corrupt? 🌎

The new tablet to beat

Import AI 402: Why NVIDIA beats AMD: vending machines vs superintelligence; harder BIG-Bench

GCP Newsletter #440

Apple Should Swap Out Siri with ChatGPT

⚡ THN Weekly Recap: Alerts on Zero-Day Exploits, AI Breaches, and Crypto Heists

⚙️ AI price war

Post from Syncfusion Blogs on 03/03/2025

Vo1d Botnet's Peak Surpasses 1.59M Infected Android TVs, Spanning 226 Countries