SRE Weekly - SRE Weekly Issue #455

View on sreweekly.com

A message from our sponsor, FireHydrant:

FireHydrant Retrospectives are now more customizable and collaborative than ever with custom templates, AI-generated answers, collaborative editing... all exportable to Google Docs and Confluence. See how our retros can save you 2+ hours on every incident.

https://firehydrant.com/blog/welcome-to-your-new-retrospective-experience-more-customizable-collaborative/

This article has 6 methods to mitigate thundering herd problems, including pretty diagrams with each.

  Sid

Some thoughts on the "second victim" concept. As a note, I was one of the participants in the discussion on which this article is based.

  Fractal Flame

Written in response to a question about the big CrowdStrike outage earlier this year, this article asks: do we need to start using safer languages?

  Kode Vicious — ACM Queue

This one used a cool technique I haven't seen yet: they hardcoded a cutoff time into the old and new systems, so they both automatically cut over simultaneously.

   Md Riyadh, Jia Long Loh, Muqi Li, and Pu Li — Grab

Here's a great writeup of a problem with the UK flight system involving a latent bug. Among several cool takeaways, I really liked the way the official incident report didn't try to pretend this weird bug could have been foreseen and prevented.

  Chris Evans — incident.io

This game day ended up way more serious than intended and exposed a serious Kubernetes configuration flaw, causing a real outage. Oops!

  Lawrence Jones

It's all fun and games until someone accidentally uses too much DTAZ (data transfer between availability zones). Good monitoring story, too!

  Grzegorz Skołyszewski — Prezi

OpenAI posted this writeup of an incident earlier this week. They tried to deploy detailed monitoring for their Kubernetes cluster, but the monitoring system overloaded the Kubernetes API.

  OpenAI

And here's Lorin Hochstein's analysis of OpenAI's incident writeup, including a recurring theme:

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability.

  Lorin Hochstein







This email was sent to you
why did I get this?    unsubscribe from this list    update subscription preferences
SRE Weekly, a production of Tinker Tinker Tinker, LLC · PO Box 253 · South Lancaster, MA 01561-0253 · USA

Older messages

SRE Weekly Issue #454

Tuesday, December 10, 2024

View on sreweekly.com Nine entire years ago, I threw together a few "issues" with my favorite SRE articles, installed Wordpress, and added a subscription form, with no clue what I was doing.

SRE Weekly Issue #453

Monday, December 2, 2024

View on sreweekly.com A message from our sponsor, FireHydrant: Why migrate from PagerDuty? Empower team-level ownership, reduce costs, decouple alerts from incidents, automate incidents end-to-end...to

SRE Weekly Issue #452

Monday, November 25, 2024

View on sreweekly.com A message from our sponsor, FireHydrant: Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team's Secret Training Ground. https://firehydrant.com/blog/the-hidden-

SRE Weekly Issue #451

Monday, November 18, 2024

View on sreweekly.com A message from our sponsor, FireHydrant: Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team's Secret Training Ground. https://firehydrant.com/blog/the-hidden-

SRE Weekly Issue #450

Monday, November 11, 2024

View on sreweekly.com A message from our sponsor, FireHydrant: Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team's Secret Training Ground. https://firehydrant.com/blog/the-hidden-

You Might Also Like

Data Science Weekly - Issue 588

Thursday, February 27, 2025

Curated news, articles and jobs related to Data Science, AI, & Machine Learning ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

💎 Issue 458 - Why Ruby on Rails still matters

Thursday, February 27, 2025

This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular Ruby news, articles and

📱 Issue 452 - Three questions about Apple, encryption, and the U.K

Thursday, February 27, 2025

This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 452 Release Date Feb 27, 2025 Your weekly report of the most popular iOS news, articles and projects Popular

💻 Issue 451 - .NET 10 Preview 1 is now available!

Thursday, February 27, 2025

This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 451 Release Date Feb 27, 2025 Your weekly report of the most popular .NET news, articles and projects

💻 Issue 458 - Full Stack Security Essentials: Preventing CSRF, Clickjacking, and Ensuring Content Integrity in JavaScript

Thursday, February 27, 2025

This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular Node.js news, articles and

💻 Issue 458 - TypeScript types can run DOOM

Thursday, February 27, 2025

This week's Awesome JavaScript Weekly Read this email on the Web The Awesome JavaScript Weekly Issue » 458 Release Date Feb 27, 2025 Your weekly report of the most popular JavaScript news, articles

💻 Issue 453 - Linus Torvalds Clearly Lays Out Linux Maintainer Roles Around Rust Code

Thursday, February 27, 2025

This week's Awesome Rust Weekly Read this email on the Web The Awesome Rust Weekly Issue » 453 Release Date Feb 27, 2025 Your weekly report of the most popular Rust news, articles and projects

💻 Issue 376 - Top 10 React Libraries/Frameworks for 2025 🚀

Thursday, February 27, 2025

This week's Awesome React Weekly Read this email on the Web The Awesome React Weekly Issue » 376 Release Date Feb 27, 2025 Your weekly report of the most popular React news, articles and projects

February 27th 2025

Thursday, February 27, 2025

Curated news all about PHP. Here's the latest edition Is this email not displaying correctly? View it in your browser. PHP Weekly 27th February 2025 Hi everyone, Laravel 12 is finally released, and

📱 Issue 455 - How Swift's server support powers Things Cloud

Thursday, February 27, 2025

This week's Awesome Swift Weekly Read this email on the Web The Awesome Swift Weekly Issue » 455 Release Date Feb 27, 2025 Your weekly report of the most popular Swift news, articles and projects