SRE Weekly - SRE Weekly Issue #441
View on sreweekly.com
This post aims to shed some light on why we migrated to Prometheus, as well as outline some of the technical challenges we faced during the process.
Eddie Bracho — Mixpanel
Amazon posted this thorough summary of a multi-service outage at the end of July. The impact stems from a complex distributed system failure in Kinesis.
Amazon
This team shows what they did to ferret out and eliminate occurrences of N+1 DB queries triggered by a single request in their Django app.
Gonzalo Lopez — Mixpanel
The folks at incident.io share about how they baked observability into the infrastructure for their new on-call tool.
Note for folks using screen readers: there's a picture without alt-text that contains the following important text:
- Overview dashboard
- System dashboard
- Logs
- Tracing
It's right after this sentence:
Those pieces fit together something like this:
Martha Lambert — incident.io
An overview of DST, which was a new concept for me. It's about running simulations to try to find faults in a distributed system.
Phil Eaton
If you build software that people depend on and are not operationally responsible for it (particularly on-call): you should be. 🛑
I like the way this one draws from the author's experience, plus the emphasis on feedback loops.
Amin Astaneh
Retries help increase service availability. However, if not done right, they can have a devastating impact on the service and elongate recovery time.
Rajesh Pandey
Keepalive pings are critical in any system that uses TCP, since connections can hang at any point. I've been meaning to write this one for years!
Lex Neva — Honeycomb
Full disclosure: Honeycomb is my employer.
|
Older messages
SRE Weekly Issue #440
Monday, September 2, 2024
View on sreweekly.com A message from our sponsor, FireHydrant: Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of
SRE Weekly Issue #439
Monday, August 26, 2024
View on sreweekly.com A message from our sponsor, FireHydrant: Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of
SRE Weekly Issue #438
Tuesday, August 20, 2024
View on sreweekly.com Are there any blind or low-vision readers out there that would be willing to answer a few questions? I'm looking to learn more about your experience of reading a newsletter
SRE Weekly Issue #437
Monday, August 12, 2024
View on sreweekly.com This week's issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I'll be back next week with a selection of all
SRE Weekly Issue #436
Monday, August 5, 2024
View on sreweekly.com A message from our sponsor, FireHydrant: Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of
You Might Also Like
Daily Coding Problem: Problem #1660 [Hard]
Monday, January 6, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Adobe. You are given a tree with an even number of nodes. Consider each connection
🐧 2025 Won't Be the Year of the Linux Desktop — Everything Apple Could Release This Year
Monday, January 6, 2025
Also: Why I Switched to macOS After Two Decades, and More! How-To Geek Logo January 6, 2025 Did You Know The "root" in root beer is literal. The original recipes for root beer used the root
Welcome to 2025 & How to get good at anything creative
Monday, January 6, 2025
Polywork shutting down, the end of news, a year of curiosity, and a lot more in this week's issue of Creativerly. Creativerly Welcome to 2025 & How to get good at anything creative By Philipp
Infographic | The Global Semiconductor Industry, in One Giant Chart 📊
Monday, January 6, 2025
American companies account for 71.5% of the semiconductor industry's global market cap, despite most chips being manufactured elsewhere. View Online | Subscribe Presented by: Non-consensus
Spyglass Dispatch: Cutting Checks, Bending Knees & Kissing Rings
Monday, January 6, 2025
Sam Altman Reflects on a Chaotic Couple Years • 2025 Golden Globes • AI TVs • Uber & Lyft + Robotaxis • Thoughts on Dune: Prophecy The Spyglass Dispatch is a newsletter sent on weekdays featuring
I saw Samsung's 8K TVs at CES 2025
Monday, January 6, 2025
🛜 My off-grid internet solution; Wi-Fi 8; AI PCs; iOS 18.2 problems -- ZDNET ZDNET Tech Today - US January 6, 2025 Samsung Neo QLED 8K TV at CES I saw Samsung's 8K TV at CES 2025 - and these 3 new
GCP Newsletter #432
Monday, January 6, 2025
Welcome to issue #432 January 6th, 2025 News AI Official Blog Public Sector A Look Back at the AI Innovations Transforming the Public Sector - In 2024, Google AI made significant advancements in
⚡ THN Weekly Recap: Top Cybersecurity Threats, Tools and Tips [6 Jan]
Monday, January 6, 2025
Your one-stop-source for last week's top cybersecurity headlines. The Hacker News Every tap, click, and swipe we make online shapes our digital lives, but it also opens doors—some we never meant to
🚀 Ready to Level Up Your Cloud, 🤖 AI and DevOps Skills?
Monday, January 6, 2025
Access top-tier courses and labs right now! Hey there, Are you still wrestling with cloud deployments, AI integrations, or DevOps workflows? Maybe you're spending hours troubleshooting, or worse –
Thoughts on Platforms, Core Teams, DORA Report and all that jazz
Monday, January 6, 2025
Everyone's hyping “platform teams” like they're the next big thing—yet I see so many struggling, often for the same reasons core teams do. In latest edition I dive into why these big, central