[Last Week in AWS Extras]: Multi-Cloud is the Worst Practice

 

One interesting aspect of our work in fixing the horrifying AWS bill is that we inadvertently stumble into the midst of various organizations' disaster recovery plans. "Turn off the DR site" may be sensible from an API perspective (after all, it's generally a bunch of idle resources!), but it's not at all tenable from a business perspective.

 

This week, we explore the basic problem with DR plans in general.

 

 

Have you heard about ChaosSearch, the fully managed log analytics platform that leverages your Amazon S3 as a data store, with no further data movement required? According to the CTO at Armor, a global cybersecurity company, “ChaosSearch is a critical piece of our infrastructure for processing terabytes per day of our customers’ log data.” And from Hubspot: “We are able to process and analyze terabytes a day of Cloudflare log data to identify and fend off DDoS attacks on behalf of our 76,000 customers at a fraction of the cost of our previous self-hosted ELK Stack.” So take it from me, Corey Quinn, or take it from the ChaosSearch customers - either way, take a look at ChaosSearch today! Sponsored

 

 

Your Disaster Recovery Plan is a Joke Written by Clowns

If you take a look somewhere in an engineering VP or Director's office, you'll find a binder that hasn't been touched in a while labeled "DR / BCM Plan."

 

Disaster Recovery / Business Continuity Management planning are important things to take into consideration. But the clowns you work for have almost certainly screwed it up well into the realm of absurdity.

Why these plans exist

These plans start with the best of intentions. "What happens if our site falls over?" is absolutely the kind of thing that responsible businesses and also Facebook need to ask. In fact, when I was shopping around for The Duckbill Group's insurance policy, one of the questions was "Do you have a DR plan?" The next sentence was "please attach a copy of it," so you couldn't just skate by.

 

Further, if your data center or cloud service provider reaches out with a "Hey, so our facility is now a smoking hole in the ground because it turns out that powering it with what is in effect a giant compressed bomb had some failure modes we didn't fully anticipate," you're going to want to have at least a rough idea of what to do next.

 

No, not "update your résumé and look for a new job," you coward. We'll get to that part later.

Why they're jokes

The problem with these plans is that they betray a severe lack of understanding about how failures work. As an environment grows and its applications become world-spanning, it's less a question of whether the site is up or down and more a question of "How down is it?"

 

Knowing when to activate the DR plan is never as clear-cut as it is in tabletop exercises. If your provider fails to communicate with you about what's going on do you activate the plan or try to wait it out?

 

DR plans also suffer from the conceit that they're able to predict the scale and scope of any given outage. "Surely if the database server fails, it won't do so in a manner that corrupts its replica" is one expression of this, and a common one.

 

But there's a darker one.

 

If you're in AWS's us-tirefire-1 and you test your plan to make the poor life decision of migrating to Ohio, that's going to work pretty well during your DR exercise. It's likely to work far less well in the event of a regional AWS outage because roughly half of the internet will be attempting to do the exact same thing.

 

Did your DR plan account for EC2 instance provisioning to take 45 minutes? Did it account for EBS latency well above normal? The "herd of elephants" problem will stampede you to death if you're not careful, and there's no good way to test for this in advance.

 

DR plans are also snapshots of fixed points in time. If you're at a shop that does quarterly DR tests—spoiler: almost none do, despite what they claim in their audit attestations—what happens is you attempt to run the DR plan from last quarter and it runs into a problem and fails. You fix that, move forward another step or two, and hit a different problem. You keep iterating on your DR plan until it works, and you get to check the blessed box on the form.

 

And then your next commit to production breaks your DR plan again.

 

Unless you're testing your DR plan continually, it's almost certainly going to break in hilarious fashion right when you need it most.

Scope

Any DR plan that isn't written by complete clowns is going to have to address up front exactly what the scale and scope of its applicability is. "We lost the primary database" is a common and great example of what your DR plan should cover. "Three quarters of the world is destroyed by an asteroid" is going to have different answers—and for almost all of us, our sites will be down because we'll all have bigger problems to worry about for the foreseeable future.

 

Even things in the middle of these two extremes—such as "AWS loses a major region for a month"—are likely to be hilariously out of touch just because they fail to account for human behaviors.

The human element

I once worked in a regulated environment where I was a key employee with respect to the DR plan. "Here's our offsite location well outside of San Francisco in case the city isn't able to sustain work; in that event we'll all rendezvous here within four hours of the disaster being declared."

 

Unless this is your first encounter with my personality, you can probably guess how that conversation went.

 

"Yes, excuse me! One question for you folks, and it's just a minor thing really. None of our computers live in San Francisco; they're cloud hosted very far away in undisclosed locations managed by AWS. Can you identify a single scenario—any scenario at all—in which AWS lost a region, San Francisco was uninhabitable for work purposes, and a single employee here gave anything remotely resembling a crap about work instead of, y'know, their families? Further, let's assume that this hit-the-lottery-jackpot-three-weeks-in-a-row scenario happens; exactly which of our employees do you believe are dumb enough to continue working for their existing salaries rather than becoming multi-million dollar a month consultants for a number of companies who suddenly have far, far, far more expensive problems than we will? I don't recall 'hire people who are incredibly intelligent about everything except knowing their own market worth' as being in our charter. Did I miss that paragraph?"

 

And then, suddenly, I wasn't invited to DR planning meetings anymore.

 

At some point, "this is ridiculous; I quit" is going to be your staff's response—and they'll be right.

 

DR plans tend to skip over this entirely and lose sight of the bigger picture. Sure, okay—you have a policy that three of your executives can’t all travel on the same plane (strangely, there’s no such policy about them riding in the same car), but half of your engineering team will quit the second you mention Azure.

Our DR policy

The Duckbill Group's DR policy states, in effect, that we back up our data a couple of different ways. We're fully remote, so should any employee's internet stop working, they can presumably work from a coffee shop or tether from a phone. Should the multiple cities in which our Cloud Economists reside suddenly become unsuitable for work, we are prepared to operate on the assumption that nobody is going to care overly much about their AWS bills that month.

 

In effect, we take a realistic view that doesn't depend upon our employees sacrificing themselves or their families' well being in extremis. We didn't expect Pete Cheslock to keep working after I messed up drop-shipping his company car because we're human beings. At some scale, you’ve gotta have a business continuity plan that transcends individuals—heck, we do ourselves!—but that flat out can’t come at the expense of overlooking people’s basic humanity.

 

If your employer's DR plan is written by clowns and assumes you'll prioritize them over your family, I suggest you find a new place to work.

 

 

Trend Micro Cloud One. It’s a security services platform for organizations building in the cloud. It’s also an automated, flexible, all-in-one solution to protect workflows and containers with cloud-native security. But to you... it’s more time to focus on what you do best— building great applications. Learn moreSponsored

 
 
 
Corey

I’m Corey Quinn

I help companies address their horrifying AWS bills by both reducing the dollars spent and helping them understanding what they’re paying for.

 
 
The Cloud

Screaming in the Cloud & AWS Morning Brief

In addition to this newsletter, I host two podcasts: Screaming in the Cloud, about the business of cloud computing, featuring me talking to folks who are good at things; and AWS Morning Brief, a show about exclusively AWS with my snark at full-tilt.

 
 
The Cloud

Sponsor an Issue

Reach over 19,000 discerning engineers, managers, and enthusiasts who actually care about the state of Amazon's cloud ecosystems.

 



Want to skip these Last Week in AWS Extras? Click here and you won't receive these Wednesday dispatches anymore.

To make sure you keep getting these emails, please add corey@lastweekinaws.com to your address book or otherwise mark me as a permitted sender.

Want out of the loop completely? Click here to tell me to leave you alone.

 

Duckbill Group

1728 Ocean Ave #307, San Francisco, CA 94112

 
                                                           

Older messages

[Last Week in AWS] Issue #174: Don't Hate the Player; Hate the Name

Monday, August 10, 2020

Good Morning! Welcome to issue 174 of Last Week in AWS. Last week featured me on Twitter answering questions about public speaking and sharing ancient sysadmin wisdom from the ancient sysadmin times.

[Last Week in AWS Extras]: Multi-Cloud is the Worst Practice

Wednesday, August 5, 2020

At long last, my definitive treatise on Multi-Cloud being a terrible best practice is out. If you'd rather hear me do a dramatic reading of this piece, see the AWS Morning Brief. As always, if you

[Last Week in AWS] Issue #173: Drastic Load Balancing Code Changes

Monday, August 3, 2020

Good Morning! Welcome to issue 173 of Last Week in AWS. Did you know you can sponsor this newsletter? It's true. If you'd like me to tell over 20000 people about your product, service, or wry

[Last Week in AWS Extras]: Amazon Interactive Video Service: An Economic Analysis

Wednesday, July 29, 2020

Whenever AWS releases a new service of note, they do a full-court press that includes interviews with select journalists, customer testimonials, a Twitter tour-de-force from various executives and

[Last Week in AWS] Issue #172: AWS re:Lease The Kraken

Monday, July 27, 2020

Good Morning! It turns out that AWS CEO Andy Jassy has apparently been letting crappy service names like 'Amazon DocumentDB (with MongoDB compatibility)', 'Amazon Honeycode,' '

You Might Also Like

Youre Overthinking It

Wednesday, January 15, 2025

Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, January 15, 2025? The

eBook: Software Supply Chain Security for Dummies

Wednesday, January 15, 2025

Free access to this go-to-guide for invaluable insights and practical advice to secure your software supply chain. The Hacker News Software Supply Chain Security for Dummies There is no longer doubt

The 5 biggest AI prompting mistakes

Wednesday, January 15, 2025

✨ Better Pixel photos; How to quit Meta; The next TikTok? -- ZDNET ZDNET Tech Today - US January 15, 2025 ai-prompting-mistakes The five biggest mistakes people make when prompting an AI Ready to

An interactive tour of Go 1.24

Wednesday, January 15, 2025

Plus generating random art, sending emails, and a variety of gopher images you can use. | #​538 — January 15, 2025 Unsub | Web Version Together with Posthog Go Weekly An Interactive Tour of Go 1.24 — A

Spyglass Dispatch: Bromo Sapiens

Wednesday, January 15, 2025

Masculine Startups • The Fall of Xbox • Meta's Misinformation Off Switch • TikTok's Switch Off The Spyglass Dispatch is a newsletter sent on weekdays featuring links and commentary on timely

The $1.9M client

Wednesday, January 15, 2025

Money matters, but this invisible currency matters more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

⚙️ Federal data centers

Wednesday, January 15, 2025

Plus: Britain's AI roadmap ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Post from Syncfusion Blogs on 01/15/2025

Wednesday, January 15, 2025

New blogs from Syncfusion Introducing the New .NET MAUI Bottom Sheet Control By Naveenkumar Sanjeevirayan This blog explains the features of the Bottom Sheet control introduced in the Syncfusion .NET

The Sequence Engineering #469: Llama.cpp is The Framework for High Performce LLM Inference

Wednesday, January 15, 2025

One of the most popular inference framework for LLM apps that care about performance. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

3 Actively Exploited Zero-Day Flaws Patched in Microsoft's Latest Security Update

Wednesday, January 15, 2025

THN Daily Updates Newsletter cover The Kubernetes Book: Navigate the world of Kubernetes with expertise , Second Edition ($39.99 Value) FREE for a Limited Time Containers transformed how we package and