SRE Weekly - SRE Weekly Issue #390
Many apologies to my email subscribers, who have seen two accidental re-sends of old issues recently due to a weird glitch in my automation. I think I’ve gotten a handle on it, and I’ll run an internal retrospective of this incident, of course.
Articles
Is it really SRE vs platform engineer? Or is there a way platforms can take site reliability to the next level?
Jennifer Riggins — The New Stack
A surgeon delves into the key component that allows a group of skilled individuals to work effectively and safely together, using the term “heed” to describe this special interaction.
Sidenote: in a hilarious coincidence this article managed to spoil me on a movie I was in the middle of watching (Arrival) — but it also put me in a really cool mindset to watch the rest of the film.
Dr. Rob Poston
More details on Square’s outage from a couple weeks ago (it was DNS).
Square
Azure had an interesting outage in its Australia East region involving a power failure and the order cooling units were restored in.
Microsoft Azure
Asking this question is how you unlock the hidden essence of an incident. This talk compares two public incident reports to show what it looks like when you dig into this question and when you don’t.
Jacob Scott — InfoQ
In this air accident, the pilots made a seemingly inexplicable mistake.
This sentence really stood out to me, especially after reading the “How Did It Make Sense at the Time?” article:
When we inexplicably grab the wrong utensil when cooking or accidentally start taking our dirty dishes to the bathroom instead of the kitchen, we should be thankful that we aren’t responsible for a plane full of people.
Admiral Cloudberg
There’s an interesting failure mode in this one that might stand out for the Kafka admins among us:
The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process.
Jakub Oleksy — GitHub
After explaining the difference between the ITIL terms “incident management” and “problem management”, this article goes into a discussion of recent trends and whether it still makes sense to draw a distinction between the two.
Luis Gonzalez — incident.io
|
Older messages
SRE Weekly Issue #385
Sunday, September 17, 2023
View on sreweekly.com Many apologies to Matt Cooper at GitHub, who is the actual author of the article Scaling Merge-ort Across GitHub from last week. Sorry for the mis-credit, Matt! A message from our
SRE Weekly Issue #389
Monday, September 11, 2023
View on sreweekly.com A message from our sponsor, Rootly: When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already
SRE Weekly Issue #388
Monday, September 4, 2023
View on sreweekly.com A message from our sponsor, Rootly: When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already
SRE Weekly Issue #388
Monday, September 4, 2023
View on sreweekly.com A message from our sponsor, Rootly: When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already
SRE Weekly Issue #387
Monday, August 28, 2023
View on sreweekly.com A message from our sponsor, Rootly: When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already
You Might Also Like
Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions
Monday, November 25, 2024
This week, we'll discuss the deduplication strategies. We'll see whether they're useful and consider scenarios where you may need them. We'll also do a reality check with the promises
How to know if your data has been exposed
Monday, November 25, 2024
How do you know if your personal data has been leaked? Imagine getting an instant notification if your SSN, credit card, or password has been exposed on the dark web — so you can take action
⚙️ Amazon and Anthropic
Monday, November 25, 2024
Plus: The hidden market of body-centric data
⚡ THN Recap: Top Cybersecurity Threats, Tools & Tips (Nov 18-24)
Monday, November 25, 2024
Don't miss the vital updates you need to stay secure. Read the full recap now. The Hacker News THN Recap: Top Cybersecurity Threats, Tools, and Practices (Nov 18 - Nov 24) We hear terms like “state
Researchers Uncover Malware Using BYOVD to Bypass Antivirus Protections
Monday, November 25, 2024
THN Daily Updates Newsletter cover Generative AI For Dummies ($18.00 Value) FREE for a Limited Time Generate a personal assistant with generative AI Download Now Sponsored LATEST NEWS Nov 25, 2024 THN
Post from Syncfusion Blogs on 11/25/2024
Monday, November 25, 2024
New blogs from Syncfusion Build World-Class Flutter Apps with Globalization and Localization By Lavanya Anaimuthu This blog explains the globalization and localization features supported in the
Is there more to your iPhone?
Monday, November 25, 2024
Have you ever wondered if there's more to your iPhone than meets the eye? Maybe you've been using it for years, but certain powerful features and settings remain hidden. That's why we'
🎉 Black Friday Early Access: 50% OFF
Monday, November 25, 2024
Black Friday discount is now live! Do you want to master Clean Architecture? Only this week, access the 50% Black Friday discount. Here's what's inside: 7+ hours of lessons .NET Aspire coming
Open Pull Request #59
Monday, November 25, 2024
LightRAG, anything-llm, llm, transformers.js and an Intro to monads for software devs ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Last chance to register: SecOps made smarter
Monday, November 25, 2024
Don't miss this opportunity to learn how gen AI can transform your security workflowsㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ elastic | Search. Observe. Protect