SRE Weekly - SRE Weekly Issue #236
Articles
A nice juicy post-incident report from the archives. Remember the first time you took down production?
Mads Hartmann — Glitch
While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.
Thanks to Jesper Lundkvist for this one.
As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.
Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook
The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.
Frank Lin — Octopus Deploy
Cloudflare uses an interesting multi-layered approach to mitigating attacks.
Omer Yoachimik — Cloudflare
The availability/reliability distinction in this article is thought-provoking.
Emily Arnott — Blameless
2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.
This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.
Dr. Richard Cook — Adaptice Capacity Labs
What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?
This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.
Richard I. Cook and Beth Adele Long — Applied Ergonomics
Outages
- Google Drive
- This is a post-analysis for two outages, one from this past week and the other from the week before.
- Discord
- Fastly
- Gandi
-
Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6
A layer 2 network loop was accidentally introduced, on two separate occasions.
Sébastien Dupas — Gandi
-
- Azure
- This was an outage on Sept. 14 in the UK South region. A cooling system was shut off in error during a maintenance procedure.
|
Older messages
SRE Weekly Issue #235
Monday, September 14, 2020
View on sreweekly.com A message from our sponsor, StackHawk: Adding application security tests to your CI pipeline is simple. It typically takes <30 minutes to setup automated testing so you can be
SRE Weekly Issue #234
Monday, September 7, 2020
View on sreweekly.com Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together. There were so many outages that I'm not even going to bother
SRE Weekly Issue #233
Monday, August 31, 2020
View on sreweekly.com A message from our sponsor, StackHawk: Did you catch the GitLab Commit keynote by StackHawk Founder Joni Klippert? View on demand now to learn about how security got left behind,
SRE Weekly Issue #231
Tuesday, August 25, 2020
View on sreweekly.com I have a special treat for you this week: 7 detailed incident reports! Just a note, I'll be on vacation next week, so I'll see you in two weeks on August 23. A message
SRE Weekly Issue #232
Tuesday, August 25, 2020
View on sreweekly.com A message from our sponsor, StackHawk: Is your company adopting GraphQL? Adding security testing is simple. Watch this 20 minute walk through to see how easy it is to get up and
You Might Also Like
Youre Overthinking It
Wednesday, January 15, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, January 15, 2025? The
eBook: Software Supply Chain Security for Dummies
Wednesday, January 15, 2025
Free access to this go-to-guide for invaluable insights and practical advice to secure your software supply chain. The Hacker News Software Supply Chain Security for Dummies There is no longer doubt
The 5 biggest AI prompting mistakes
Wednesday, January 15, 2025
✨ Better Pixel photos; How to quit Meta; The next TikTok? -- ZDNET ZDNET Tech Today - US January 15, 2025 ai-prompting-mistakes The five biggest mistakes people make when prompting an AI Ready to
An interactive tour of Go 1.24
Wednesday, January 15, 2025
Plus generating random art, sending emails, and a variety of gopher images you can use. | #538 — January 15, 2025 Unsub | Web Version Together with Posthog Go Weekly An Interactive Tour of Go 1.24 — A
Spyglass Dispatch: Bromo Sapiens
Wednesday, January 15, 2025
Masculine Startups • The Fall of Xbox • Meta's Misinformation Off Switch • TikTok's Switch Off The Spyglass Dispatch is a newsletter sent on weekdays featuring links and commentary on timely
The $1.9M client
Wednesday, January 15, 2025
Money matters, but this invisible currency matters more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
⚙️ Federal data centers
Wednesday, January 15, 2025
Plus: Britain's AI roadmap
Post from Syncfusion Blogs on 01/15/2025
Wednesday, January 15, 2025
New blogs from Syncfusion Introducing the New .NET MAUI Bottom Sheet Control By Naveenkumar Sanjeevirayan This blog explains the features of the Bottom Sheet control introduced in the Syncfusion .NET
The Sequence Engineering #469: Llama.cpp is The Framework for High Performce LLM Inference
Wednesday, January 15, 2025
One of the most popular inference framework for LLM apps that care about performance. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
3 Actively Exploited Zero-Day Flaws Patched in Microsoft's Latest Security Update
Wednesday, January 15, 2025
THN Daily Updates Newsletter cover The Kubernetes Book: Navigate the world of Kubernetes with expertise , Second Edition ($39.99 Value) FREE for a Limited Time Containers transformed how we package and