[Last Week in AWS Extras]: Multi-Cloud is the Worst Practice

 

One interesting aspect of our work in fixing the horrifying AWS bill is that we inadvertently stumble into the midst of various organizations' disaster recovery plans. "Turn off the DR site" may be sensible from an API perspective (after all, it's generally a bunch of idle resources!), but it's not at all tenable from a business perspective.

 

This week, we explore the basic problem with DR plans in general.

 

 

Have you heard about ChaosSearch, the fully managed log analytics platform that leverages your Amazon S3 as a data store, with no further data movement required? According to the CTO at Armor, a global cybersecurity company, “ChaosSearch is a critical piece of our infrastructure for processing terabytes per day of our customers’ log data.” And from Hubspot: “We are able to process and analyze terabytes a day of Cloudflare log data to identify and fend off DDoS attacks on behalf of our 76,000 customers at a fraction of the cost of our previous self-hosted ELK Stack.” So take it from me, Corey Quinn, or take it from the ChaosSearch customers - either way, take a look at ChaosSearch today! Sponsored

 

 

Your Disaster Recovery Plan is a Joke Written by Clowns

If you take a look somewhere in an engineering VP or Director's office, you'll find a binder that hasn't been touched in a while labeled "DR / BCM Plan."

 

Disaster Recovery / Business Continuity Management planning are important things to take into consideration. But the clowns you work for have almost certainly screwed it up well into the realm of absurdity.

Why these plans exist

These plans start with the best of intentions. "What happens if our site falls over?" is absolutely the kind of thing that responsible businesses and also Facebook need to ask. In fact, when I was shopping around for The Duckbill Group's insurance policy, one of the questions was "Do you have a DR plan?" The next sentence was "please attach a copy of it," so you couldn't just skate by.

 

Further, if your data center or cloud service provider reaches out with a "Hey, so our facility is now a smoking hole in the ground because it turns out that powering it with what is in effect a giant compressed bomb had some failure modes we didn't fully anticipate," you're going to want to have at least a rough idea of what to do next.

 

No, not "update your résumé and look for a new job," you coward. We'll get to that part later.

Why they're jokes

The problem with these plans is that they betray a severe lack of understanding about how failures work. As an environment grows and its applications become world-spanning, it's less a question of whether the site is up or down and more a question of "How down is it?"

 

Knowing when to activate the DR plan is never as clear-cut as it is in tabletop exercises. If your provider fails to communicate with you about what's going on do you activate the plan or try to wait it out?

 

DR plans also suffer from the conceit that they're able to predict the scale and scope of any given outage. "Surely if the database server fails, it won't do so in a manner that corrupts its replica" is one expression of this, and a common one.

 

But there's a darker one.

 

If you're in AWS's us-tirefire-1 and you test your plan to make the poor life decision of migrating to Ohio, that's going to work pretty well during your DR exercise. It's likely to work far less well in the event of a regional AWS outage because roughly half of the internet will be attempting to do the exact same thing.

 

Did your DR plan account for EC2 instance provisioning to take 45 minutes? Did it account for EBS latency well above normal? The "herd of elephants" problem will stampede you to death if you're not careful, and there's no good way to test for this in advance.

 

DR plans are also snapshots of fixed points in time. If you're at a shop that does quarterly DR tests—spoiler: almost none do, despite what they claim in their audit attestations—what happens is you attempt to run the DR plan from last quarter and it runs into a problem and fails. You fix that, move forward another step or two, and hit a different problem. You keep iterating on your DR plan until it works, and you get to check the blessed box on the form.

 

And then your next commit to production breaks your DR plan again.

 

Unless you're testing your DR plan continually, it's almost certainly going to break in hilarious fashion right when you need it most.

Scope

Any DR plan that isn't written by complete clowns is going to have to address up front exactly what the scale and scope of its applicability is. "We lost the primary database" is a common and great example of what your DR plan should cover. "Three quarters of the world is destroyed by an asteroid" is going to have different answers—and for almost all of us, our sites will be down because we'll all have bigger problems to worry about for the foreseeable future.

 

Even things in the middle of these two extremes—such as "AWS loses a major region for a month"—are likely to be hilariously out of touch just because they fail to account for human behaviors.

The human element

I once worked in a regulated environment where I was a key employee with respect to the DR plan. "Here's our offsite location well outside of San Francisco in case the city isn't able to sustain work; in that event we'll all rendezvous here within four hours of the disaster being declared."

 

Unless this is your first encounter with my personality, you can probably guess how that conversation went.

 

"Yes, excuse me! One question for you folks, and it's just a minor thing really. None of our computers live in San Francisco; they're cloud hosted very far away in undisclosed locations managed by AWS. Can you identify a single scenario—any scenario at all—in which AWS lost a region, San Francisco was uninhabitable for work purposes, and a single employee here gave anything remotely resembling a crap about work instead of, y'know, their families? Further, let's assume that this hit-the-lottery-jackpot-three-weeks-in-a-row scenario happens; exactly which of our employees do you believe are dumb enough to continue working for their existing salaries rather than becoming multi-million dollar a month consultants for a number of companies who suddenly have far, far, far more expensive problems than we will? I don't recall 'hire people who are incredibly intelligent about everything except knowing their own market worth' as being in our charter. Did I miss that paragraph?"

 

And then, suddenly, I wasn't invited to DR planning meetings anymore.

 

At some point, "this is ridiculous; I quit" is going to be your staff's response—and they'll be right.

 

DR plans tend to skip over this entirely and lose sight of the bigger picture. Sure, okay—you have a policy that three of your executives can’t all travel on the same plane (strangely, there’s no such policy about them riding in the same car), but half of your engineering team will quit the second you mention Azure.

Our DR policy

The Duckbill Group's DR policy states, in effect, that we back up our data a couple of different ways. We're fully remote, so should any employee's internet stop working, they can presumably work from a coffee shop or tether from a phone. Should the multiple cities in which our Cloud Economists reside suddenly become unsuitable for work, we are prepared to operate on the assumption that nobody is going to care overly much about their AWS bills that month.

 

In effect, we take a realistic view that doesn't depend upon our employees sacrificing themselves or their families' well being in extremis. We didn't expect Pete Cheslock to keep working after I messed up drop-shipping his company car because we're human beings. At some scale, you’ve gotta have a business continuity plan that transcends individuals—heck, we do ourselves!—but that flat out can’t come at the expense of overlooking people’s basic humanity.

 

If your employer's DR plan is written by clowns and assumes you'll prioritize them over your family, I suggest you find a new place to work.

 

 

Trend Micro Cloud One. It’s a security services platform for organizations building in the cloud. It’s also an automated, flexible, all-in-one solution to protect workflows and containers with cloud-native security. But to you... it’s more time to focus on what you do best— building great applications. Learn moreSponsored

 
 
 
Corey

I’m Corey Quinn

I help companies address their horrifying AWS bills by both reducing the dollars spent and helping them understanding what they’re paying for.

 
 
The Cloud

Screaming in the Cloud & AWS Morning Brief

In addition to this newsletter, I host two podcasts: Screaming in the Cloud, about the business of cloud computing, featuring me talking to folks who are good at things; and AWS Morning Brief, a show about exclusively AWS with my snark at full-tilt.

 
 
The Cloud

Sponsor an Issue

Reach over 19,000 discerning engineers, managers, and enthusiasts who actually care about the state of Amazon's cloud ecosystems.

 



Want to skip these Last Week in AWS Extras? Click here and you won't receive these Wednesday dispatches anymore.

To make sure you keep getting these emails, please add corey@lastweekinaws.com to your address book or otherwise mark me as a permitted sender.

Want out of the loop completely? Click here to tell me to leave you alone.

 

Duckbill Group

1728 Ocean Ave #307, San Francisco, CA 94112

 
                                                           

Older messages

[Last Week in AWS] Issue #174: Don't Hate the Player; Hate the Name

Monday, August 10, 2020

Good Morning! Welcome to issue 174 of Last Week in AWS. Last week featured me on Twitter answering questions about public speaking and sharing ancient sysadmin wisdom from the ancient sysadmin times.

[Last Week in AWS Extras]: Multi-Cloud is the Worst Practice

Wednesday, August 5, 2020

At long last, my definitive treatise on Multi-Cloud being a terrible best practice is out. If you'd rather hear me do a dramatic reading of this piece, see the AWS Morning Brief. As always, if you

[Last Week in AWS] Issue #173: Drastic Load Balancing Code Changes

Monday, August 3, 2020

Good Morning! Welcome to issue 173 of Last Week in AWS. Did you know you can sponsor this newsletter? It's true. If you'd like me to tell over 20000 people about your product, service, or wry

[Last Week in AWS Extras]: Amazon Interactive Video Service: An Economic Analysis

Wednesday, July 29, 2020

Whenever AWS releases a new service of note, they do a full-court press that includes interviews with select journalists, customer testimonials, a Twitter tour-de-force from various executives and

[Last Week in AWS] Issue #172: AWS re:Lease The Kraken

Monday, July 27, 2020

Good Morning! It turns out that AWS CEO Andy Jassy has apparently been letting crappy service names like 'Amazon DocumentDB (with MongoDB compatibility)', 'Amazon Honeycode,' '

You Might Also Like

Dot Leap 2024-7: Fat Stacks and Race Tracks

Saturday, April 20, 2024

Thank you for taking the Dot Leap!We cover Polkadot, Kusama, Polkadot-SDK, and all related Web 3.0 projects! Want your content featured? X ... ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

📧 Introduction to Distributed Tracing With OpenTelemetry in .NET

Saturday, April 20, 2024

​ Introduction to Distributed Tracing With OpenTelemetry in .NET Read on: m​y website / Read time: 5 minutes BROUGHT TO YOU BY ​ Shesha: The .NET Open-Source Low-Code Framework ​ Introducing Shesha, a

a16z’s Infrastructure team gets a new general partner

Friday, April 19, 2024

Post News is shutting down and Wall Street isn't feeling a Salesforce-Informatica pairing View this email online in your browser By Christine Hall Friday, April 19, 2024 Image Credits: Andreessen

New Roundtable! Additive for Mass Production Applications

Friday, April 19, 2024

The Outlook for the Future View this email in your browser engineering.com Roundtable - Additive for Mass Production Applications: The Outlook for the Future 6 Considerations for Choosing the Right

📷 What to Know About Macro Photography — Why You Should Buy a Budget Motherboard

Friday, April 19, 2024

Also: How to Automatically Highlight Values in Excel, and More! How-To Geek Logo April 19, 2024 📩 Get expert reviews, the hottest deals, how-to's, breaking news, and more delivered directly to your

Is the wind going out of the AI sails?

Friday, April 19, 2024

Rippling vacuums up venture capital and Ramp bags more millions View this email online in your browser By Haje Jan Kamps Friday, April 19, 2024 Image Credits: Getty Images / Carol Yepes Welcome to

Llama 3 is out - Weekly News Roundup - Issue #463

Friday, April 19, 2024

Plus: brand-new, all-electric Atlas; AI Index Report 2024; Microsoft pitched GenAI tools to US military; Humane AI Pin reviews are in; debunking Devin; and more! ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Daily Coding Problem: Problem #1417 [Easy]

Friday, April 19, 2024

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Wayfair. You are given a 2 x N board, and instructed to completely cover the board with

Charted | How Hard Is It to Get Into an Ivy League School? 🎓

Friday, April 19, 2024

We detail the admission rates and average annual cost for Ivy League schools, as well as the median SAT scores required to be accepted. View Online | Subscribe Presented by: Discover the motivations

Dark Matter & Tortured Poets

Friday, April 19, 2024

New music releases aren't what they used to be -- for good and bad. Dark Matter & Tortured Poets By MG Siegler • 19 Apr 2024 View in browser View in browser New music releases in 2024 are a