SRE Weekly - SRE Weekly Issue #222
Articles
This article in a nutshell:
- Nines don’t matter if users aren’t happy (h/t Charity Majors)
- Chaos engineering
Kolton Andrus — Gremlin
I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.
Ayende Rahien — RavenDB
In our experience, the three big sources of production stress are:
- Toil
- Bad monitoring
- Immature incident handling procedures
Cheryl Kang — Google
ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.
Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica
There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:
- the boundary to economic failure
- the boundary of unacceptable work load
- the boundary of functionally acceptable performance
Lorin Hochstein
This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.
Bill Duncan
Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.
Rich Burroughs — FireHydrant
A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.
Paul Ducklin — Naked Security
Outages
- PagerDuty
- Coinbase
- Coinbase had an outage on June 1. Click for their post-incident analysis.
- Robinhood
- Robinhood’s status page doesn’t show history, so I can’t verify this one.
- iCloud
- Ebay
- Ebay’s status page also doesn’t show history, so I can’t verify this one either.
- Lloyds and Halifax (bank)
- Adobe Cloud
- Squarespace
- Their followup post discusses the large-scale DDoS that contributed to the outage.
- HostedGraphite
- Telegram
|
Older messages
SRE Weekly Issue #221
Monday, June 1, 2020
View on sreweekly.com Don't forget, Catchpoint's SRE From Home event is happening this Friday. The speaker list has some names you'll recognize from articles linked here in previous issues.
SRE Weekly Issue #220
Monday, May 25, 2020
View on sreweekly.com A message from our sponsor, StackHawk: Hi, SRE Weekly. We're your new newsletter sponsor, StackHawk. We believe that application security is an important part of reliability
SRE Weekly Issue #219
Monday, May 18, 2020
View on sreweekly.com Articles Download our new on-call book [Atlassian] Check out this new 100-page ebook on incident response from Atlassian, great for folks setting up a brand new on-call structure
SRE Weekly Issue #218
Monday, May 11, 2020
View on sreweekly.com Articles Checklists and Runbooks An airplane pilot's take on runbooks, by way of comparison to aviation checklists. Bill Duncan Old box, dumb code, few thousand connections,
SRE Weekly Issue #217
Monday, May 4, 2020
View on sreweekly.com A message from our sponsor, VictorOps: Our people and tools need to be connected now more than ever before. That's why VictorOps is offering free, 90-day extended Enterprise
You Might Also Like
Youre Overthinking It
Wednesday, January 15, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, January 15, 2025? The
eBook: Software Supply Chain Security for Dummies
Wednesday, January 15, 2025
Free access to this go-to-guide for invaluable insights and practical advice to secure your software supply chain. The Hacker News Software Supply Chain Security for Dummies There is no longer doubt
The 5 biggest AI prompting mistakes
Wednesday, January 15, 2025
✨ Better Pixel photos; How to quit Meta; The next TikTok? -- ZDNET ZDNET Tech Today - US January 15, 2025 ai-prompting-mistakes The five biggest mistakes people make when prompting an AI Ready to
An interactive tour of Go 1.24
Wednesday, January 15, 2025
Plus generating random art, sending emails, and a variety of gopher images you can use. | #538 — January 15, 2025 Unsub | Web Version Together with Posthog Go Weekly An Interactive Tour of Go 1.24 — A
Spyglass Dispatch: Bromo Sapiens
Wednesday, January 15, 2025
Masculine Startups • The Fall of Xbox • Meta's Misinformation Off Switch • TikTok's Switch Off The Spyglass Dispatch is a newsletter sent on weekdays featuring links and commentary on timely
The $1.9M client
Wednesday, January 15, 2025
Money matters, but this invisible currency matters more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
⚙️ Federal data centers
Wednesday, January 15, 2025
Plus: Britain's AI roadmap
Post from Syncfusion Blogs on 01/15/2025
Wednesday, January 15, 2025
New blogs from Syncfusion Introducing the New .NET MAUI Bottom Sheet Control By Naveenkumar Sanjeevirayan This blog explains the features of the Bottom Sheet control introduced in the Syncfusion .NET
The Sequence Engineering #469: Llama.cpp is The Framework for High Performce LLM Inference
Wednesday, January 15, 2025
One of the most popular inference framework for LLM apps that care about performance. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
3 Actively Exploited Zero-Day Flaws Patched in Microsoft's Latest Security Update
Wednesday, January 15, 2025
THN Daily Updates Newsletter cover The Kubernetes Book: Navigate the world of Kubernetes with expertise , Second Edition ($39.99 Value) FREE for a Limited Time Containers transformed how we package and