Incident Review and Postmortem Best Practices
👋 Hi, this is Gergely with this month’s free edition of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at big tech and high-growth startups through the lens of engineering managers and senior engineers. If you’re not a subscriber, here are recent issues you missed the past month:
Subscribe to get weekly issues. Many subscribers expense this newsletter to their learning and development budget.👇 Incident Review and Postmortem Best PracticesA survey of how companies deal with incidents today, and a peek into the best practices of the future.I’ve launched The Pragmatic Engineer Job Board, listing senior engineer and engineering leadership positions at companies that score highly on The Pragmatic Engineer Test. See featured positions at the end of this email. One reason incidents are important is that they often reveal the real state of products, teams or organizations, which is often very different from the imaginary picture that engineering leaders have in their heads. Transparent incident reports and a good incident-handling strategy can inject much-needed realism into the development process. It’s hard to brush aside incidents that have caused specific damage, indeed these incidents are often powerful ways to make cases for work that would otherwise be delayed indefinitely. The only certain thing about outages is that they will always happen. Everything else is up to us. How much effort do we put into preventing them, or into mitigating them quickly, or learning from them? I’ve talked with dozens of engineers at a variety of companies and have found that no one is fully happy with how they are handling incidents. In this issue we cover:
The oncall process – monitoring and alerting – is beyond the scope of this article. We cover what happens starting from when an outage or incident is confirmed. Note: This article mentions vendors specializing in incident handling-related services. I have not been paid to mention these companies. My newsletter is fully independent, does not offer sponsorships and I have no commercial affiliation with the vendors mentioned. How Companies Handle IncidentsMore than 60 teams have shared information on how they respond to outages, and what happens afterwards. See aggregated survey results here. 98.5% of companies sharing details have an incident management process in place. The only exception was a Series C property management company which used an ad-hoc process on top of email, to manage incidents. I was expecting more variation on the process, the tooling and the steps, but there were large overlaps across companies. This might be explained by selection bias, meaning the survey was mostly filled out by engineers whose teams already take incidents seriously. The majority of companies sharing data follow a process that can be summarized like this:
What tooling do teams use when handling incidents? Based on survey responses, these were the most common mentions: Less common approaches that a few companies mentioned:
Incident Handling Best PracticesWhat are practices that stand out in their effectiveness, when we think of incidents?
Blameless reviews/postmortems are worth talking more about. When doing a root cause analysis, avoid making it seem like a single person is responsible for the incident. Most outages will be caused by configuration or code changes that someone did, and it’s very easy to find out who it was. However, it could have been anyone else on the team. Instead of directly or indirectly putting the blame on one person, go further and look at why systems allowed anyone to make those changes, without providing feedback. If the conditions that allowed the incident to happen are unaddressed, these conditions could trip up someone else on the team in future. Some people resist the idea of “pushing” the use of blameless postmortems across an organization. “Will this not lead to a lack of accountability?” they ask. But accountability and a blameless culture are two separate things in my view, which can – and do – go hand in hand. Accountability means people take responsibility for their work, and when things inevitably go wrong, they take ownership of fixing things. A blameless approach recognizes and embraces that it’s counterproductive to fault someone for doing something they were unaware would cause an issue. This is especially the case when those people take accountability for addressing the reasons that led to the issue. Well-run incident review meetings are key to sharing context with all involved, and to coming to better learnings. Here is a structure that some of the best run review discussions followed:
I sat down with Chris Evans, co-founder and CPO at incident.io, formerly head of platform at Monzo. I asked him which other practices he’s seen adopted by engineering teams which are great at incident handling. Here’s what he shared:
Beyond the Best Practices of TodayObservability platform Honeycomb stood out in their responses to my survey. They were one of the few firms which are moving away from templates and do not track action items. Unlike most other teams which want to do this, they actually took the step. From many other teams, I would have interpreted this as trying out something that might or might not work. However, Honeycomb handles huge amounts of data, and provides stricter Service Level Agreements (SLAs) than most of their competitors. They also take reliability very seriously, so much that a 5-minute delay in data processing is an outage they publicly report. I asked Honeycomb engineer Paul Osman why they made this shift, and how it’s working out. Paul shared how two years ago, they noticed that action items coming up during reviews were not particularly interesting findings; they were mostly things the team was going to do, anyway. Focusing on the learning during incident reviews, over explicitly tracking action items, has been the biggest shift, Paul shared. Teams and individual engineers still create their own tasks, not just for incident reviews, but for everyday work. They’ll also create those tasks even if there’s not an incident review. Paul said: ‘For us, the review is more about “who knew what when and how did they know it?” and “how did our systems surprise us?” instead of “what action items can we extract from this?” ‘Most of the interesting insights we’ve found are in a different category from tasks. They’re things like “new traffic patterns can show up as red herrings when debugging” or “Hey, this person is needed whenever we have a problem with this system, we should schedule someone to pair with them so they can go on vacation”. Honeycomb have left behind some practices that outlived their usefulness, and seem to be better off for it. However, I’d add that they’re still a relatively small group – around 35 engineers – at a place where everyone operates with high autonomy. Moving on from the Five Whys method for gathering more information, is another approach used by teams which are great at incident handling. The Five Whys is still considered as a best practice by many teams and is a common way to run the root-cause analysis process. The idea here is to ask “why” in succession, going deeper and uncovering more information each time. The framework is very easy to get started, when teams don’t do much digging into incidents. However, as Andrew Hatch at LinkedIn shares in the talk Learning More from Complex systems, there are risks to relying on the Five Whys:
John Allspaw also advocates against using the Five Whys in his The Dangers of The Five Whys article. He says that asking “why” is a sure way to start looking for a “who”, and to start looking for who is to blame for the outage. Instead of asking “why?” multiple times, consider asking “how did it make sense for someone to do what they did?”. Then dig deeper in gathering the context that made the choices logical, in the context of the people acting. Here are some other questions to consider asking to run healthy debriefs. Take a socio-technical systems approach to understand the outage. A root cause analysis or the Five Whys approach can be appealing because they simplify a post-incident effort, but they will not address the deeper systemic issues within your system, says Laura Maguire, Head of Research at Jeli.io. Socio-technical systems approaches look at both the technical aspects – what broke – and the ‘social’ aspects; how the incident was handled. This gives an oncall team deeper information about software component failures. It also surfaces relevant organizational factors, such as who has specialized knowledge about different aspects of the system and which stakeholders need updates at what frequencies, in order to minimize impacts to their areas of the business. Some guidance for conducting a systemic investigation:
Incident Review Practices of TomorrowI sat down to talk about the future of incident management with John Allspaw, who has been heavily involved in this space for close to a decade. He was engineer #9 at Flickr at the time of Flickr’s massive growth phase. He then became the CTO of Etsy, where he worked for seven years. For the past decade, John has been going deeper and deeper into how to build better resilient systems. He enrolled in Lund University in Sweden in the Human Factors and Safety program, a program where some of the leading thinkers in resilience engineering seem to have crossed paths. He has founded Adaptive Capacity Labs, which partners with organizations that want to further improve how they handle and learn from incidents. As we talked for an hour, I kept being surprised by the depth of his understanding of resilient systems. It slowly made sense why he spent so much time studying non-tech related subjects, all connected to resilient engineering systems. John also exposed a world of decades-old research outside tech that I would have never thought to look to as inspiration to build better, more reliable systems, but which perhaps we all should. Here’s a summary of our conversation. My questions are in italics. Why have you been spending so much time on the incident space? ‘Incidents create attention energy around them. When something goes wrong, people pay far more attention to everything around the event than when it’s business as usual. This is also why incidents and outages can be a catalyst to kicking off larger changes within any organization, not just tech companies.’ What is your view on industry best practices which many companies follow, like templates, incident reviews, follow-up items? ‘Incident handling practices in the industry are well-intentioned, and point in the right direction. However, these practices are often poorly calibrated. We talk a lot about learning from incidents, and some learning is certainly happening. However, it’s not happening as efficiently as it could. ‘We often confuse fixing things fast with learning. Take the incident review that most companies follow. There’s an hour, at most, to go through multiple incidents. There is often more focus on generating follow-up items than learning from the incident. In fact, most people seem to think that by generating follow-up items, learning will also happen. However, this is far from the case. ‘When we talk with companies, we ask them to describe how they handle incidents. They jump in, talk about how they detect outages, how they respond, who plays what role, and which tools they use. They’ll often mention cliches like “we never let a good incident go to waste.” ‘However, most teams cannot describe a major incident in detail. We typically ask them to talk us through a specific incident, instead of sharing their generic approach. The answer is almost always along the lines of there’s a document, or a ticketing system where this is written down. However, when we find and start reading this document, it’s usually a disappointment and does a poor job in conveying takeaways. ‘Most incidents are written to be filed, not to be read or learned from. This is what we come across again and again. Teams go through incidents, they file a report, pat themselves in the back, and think they’ve learned from it. In reality, the learning has been a fraction of what it could have been. ‘The current incident handling approaches are only scratching the surface of what we could be doing. In several ways, tech is behind several other industries in how we architect reliable systems.’ What Tech Can Learn From Other IndustriesWhat can we as software engineers learn from other industries? ‘Luckily, we have decades of studies on incidents and reliable systems across several industries. And there are plenty of applicable learnings that apply to tech. ‘A common myth is that distributing learnings from incidents is the biggest blocker on improving more. Many teams and people will believe that if only they find a better way to share incident learnings – like make them easier to search or email them out to a larger group – then this will solve the issue of the organization improving from them. ‘However, this belief has been refuted by research many times. The key challenge is the author of a document cannot predict what will be novel or interesting for the reader. Whoever is writing the incident summary will not be able to tell what information will be well-known to the reader. The person writing the incident summary will also often not write down things they assume everyone else to know. However, many readers will not be familiar with them. ‘Studies repeatedly show that experts have a hard time describing what makes them an expert. This applies to incidents; experienced engineers who mitigate incidents efficiently will have a hard time describing what it was that allowed them to act as swiftly as they did. ‘Much of how we handle incidents is tacit knowledge, that which is not explicit. The question of “how do we build a better incident handling culture?” is not too different from “how do we help people become experts on a topic?”. And the answer needs to go beyond writing things down. A good example is how, to learn to skate, you cannot just rely on reading books about skating.’ Before I talked with John, I doubted that tech had anything to learn from other industries. ‘We’re in software, in tech, building things that have never been built before’, I thought. Look at mobile phones, cloud computing or Snapchat; none of this has ever existed! However, the more we talked, the more it struck home how outages or incidents; stressful situations when something goes wrong, are not unique to software. In fact, often they don't have anything to do with software. These types of unexpected, disruptive events have been happening since before the invention of the wheel. So of course it makes sense there is an accumulated knowledge on how to prepare people for an incident they are yet to experience. My biggest takeaway talking with John was how a written culture is not enough to create a great incident culture. Writing things down is important, but it won’t cut it alone. I remember the best oncall onboarding process we had at Uber. It had nothing to do with documentation. It was a simulation of an incident. The facilitating team deliberately disrupted a non-critical service similar to an incident that happened in the past. The people being onboarded knew this was not a real incident, but they were called into a Zoom call in which one of them was named as Incident Commander and the facilitators played along. The team then did a debrief and analyzed what they could have done differently. This exercise achieved far more than any studying of documentation could have. I do not have data to base this on, but it felt to me like the people who went though this simulation were far more prepared for the real thing. It felt like they had more confidence going on their first oncall, as they had already been through an incident. This approach got me thinking. ‘Is this why the military does training exercises, despite the high cost?’ John did not answer, but asked the question: ‘Why do you think they do it? Would they do so if reading books or watching videos got them a similar result?’ There’s a lot that we in the software tech industry can learn about how to build resilient systems, by learning about how other industries have been building resilient systems. The Opportunity to Build More Resilient SystemsAs our conversation closed, John mentioned how he thinks the software industry is, in some ways, ahead of most other industries in building resilient systems. He told me how he invited a person to the Velocity conference, one focused on resilience, performance and security in tech. This person spent decades researching resilient systems in healthcare, and assumed they would teach the audience. Instead, this person was amazed at how much they had to learn from tech. As John put it: ‘No other industry has as much, nor as detailed incident data available, as tech does. After an outage happens, engineers have access to code and configuration changes, logs and analytics, often down to the millisecond. You have all this data without having to do much preparation, or go through obstacles to get this data. ‘Compare this to, for example, the medical field. There, you have to do huge amounts of preparation to get data on what happens in an operating theatre. You need to install cameras and microphones. You get permission from all parties to record. You operate the recording equipment, then transfer all the data for processing after the operation. ‘Access to data is far more strict in, for example, the aviation industry. Let’s say you want to investigate a plane near-miss incident. To access the logs, you need to start a special process, and after many approvals, you can talk with the pilot only in the presence of a union representative, and potentially a lawyer. If you forgot to ask something relevant, you need to start the process again. ‘Most engineers believe progress in the software industry means progress in how we build software. However, I believe the real progress is how we get more and more data to see how incidents unfold to the point that we’ll be able to answer the question: what made this incident difficult?’ Software organizations already have all the data they need to improve how they operate, we both concluded. This is a massive advantage compared to all other industries. I was left wondering, do we realize the privileged position tech puts us in, of having both the data to work with, and the autonomy to do so? And for those who do realize, will we take the opportunity to create much more resilient systems, challenging the best practices of 2021, and pushing ahead to a world where we use incidents to learn and adapt, not just track action items? Great Incident Review ExamplesFew things better show the privileged status of tech than how we already have access to some of the best incident reviews ever written. Companies from Cloudflare, through GitLab and many others, have made these available for anyone who wants to read them. Do keep in mind that public blog posts about incidents are not the same as incident analyses intended for an organization to learn from, and will often not represent the whole story. The Verica Open Incident database built by Courtney Nash is the most exhaustive public incidents database you can find, and one I’d suggest browsing and bookmarking. Additionally, here are a few incident analyses that I especially enjoyed reading. ConclusionsWe’ve covered a lot of ground in this issue of the newsletter. Depending on where you are at on your team with incidents, I’ll leave you with one of these pieces of advice:
Senior & Leadership Jobs on my new Job Board
Browse more senior engineer and engineering leadership roles, or add your own on The Pragmatic Engineer Job Board. 🤔 How would you rate this week's newsletter?Amazing • Great • Good • OK • So-so Additional ResourcesResources:
Further related reading:
Further related videos:
Thanks to Chris, John, Laura, Paul, for their input in this article, and Alexandru, John, Julik, Kurt, Miljian, Michał and Marco for their review comments. You’re on the free list for The Pragmatic Engineer. For the full experience, become a paying subscriber. Many readers expense this newsletter within their company’s training/learning/development budget. This post is public, so feel free to share and forward it. |
Older messages
Real-World Engineering Challenges Roundup
Friday, October 8, 2021
Caching, multi-tenancy, E2E tests, and database cutovers. Issue 1.
How Big Tech Runs Tech Projects and the Curious Absence of Scrum
Tuesday, September 21, 2021
A survey of how tech projects run across the industry highlights Scrum being absent from Big Tech. Why is this, and are there takeaways others should take note of?
Advice for Tech Workers to Navigate the Most Heated Job Market of All Time
Wednesday, September 15, 2021
The job market is on fire across the globe. Here's advice on how to make the most out of it.
You Might Also Like
Data Science Weekly - Issue 540
Friday, March 29, 2024
Curated news, articles and jobs related to Data Science, AI, & Machine Learning ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
This Week in Rust #540
Friday, March 29, 2024
Email isn't displaying correctly? Read this e-mail on the Web This Week in Rust issue 540 — 27 MAR 2024 Hello and welcome to another issue of This Week in Rust! Rust is a programming language
The Value Of A Promise 🤞
Friday, March 29, 2024
How much is a promise from a tech company really worth, anyway? Here's a version for your browser. Hunting for the end of the long tail • March 28, 2024 The Value Of A Promise When you hear a
New Elastic Security for SIEM Training Course
Friday, March 29, 2024
Detect and respond to evolving threats ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ elastic | Search. Observe. Protect Detect anomalies and malicious behavior March
SBF gets 25 years
Thursday, March 28, 2024
Sam Bankman-Fried is sentenced View this email online in your browser By Christine Hall Thursday, March 28, 2024 Welcome back to TechCrunch PM! The editorial team spent a chunk of the day discussing
💎 Issue 410 - Being laid off in 2023-2024 as an early-career developer
Thursday, March 28, 2024
This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 410 Release Date Mar 28, 2024 Your weekly report of the most popular Ruby news, articles and
💻 Issue 403 - Microsoft defends .NET 9 features competing with open source ecosystem
Thursday, March 28, 2024
This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 403 Release Date Mar 28, 2024 Your weekly report of the most popular .NET news, articles and projects
💻 Issue 410 - Node.js TSC Confirms: No Intention to Remove npm from Distribution
Thursday, March 28, 2024
This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 410 Release Date Mar 28, 2024 Your weekly report of the most popular Node.js news, articles and
💻 Issue 410 - JSDoc as an alternative TypeScript syntax
Thursday, March 28, 2024
This week's Awesome JavaScript Weekly Read this email on the Web The Awesome JavaScript Weekly Issue » 410 Release Date Mar 28, 2024 Your weekly report of the most popular JavaScript news, articles
📱 Issue 404 - Dependency Injection for Modern Swift Applications Part II
Thursday, March 28, 2024
This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 404 Release Date Mar 28, 2024 Your weekly report of the most popular iOS news, articles and projects Popular