Incident management in the real world

In our current Software-as-a-Service world, on-call and incident management has now become very prevalent due to the nature of how Saas software is constantly deployed. When this field was new, incident management culture borrowed heavily from other industries, airlines, automotive supply chain lines, defense, medicine and others. We learned and borrowed a lot from them and added our own touches.

However, I have started seeing some concerning trends where the theory of incident management has started to diverge from the reality of how incidents work. A lot of the conversation takes the approach that what has worked, or not worked in a particular set of organizations (usually FAANG) is what is best for everyone else. To start with, it’s an open secret that most best practices from large engineering organizations aren’t even practiced consistently inside those organizations. However, all the books, articles and talks lead many engineers attempting to follow these best practices in good faith, only to end up frustrated and disappointed, when they don’t work because they haven’t been tested in the real world. I try to share my view on some of the examples I have personally observed, in the hope that it can lead to some course correction.

First, a reminder that the most important goal of incident management practices is - more reliable and resilient systems, and more productive engineers. Incident management best practices need to serve a business or user need and cannot exist in isolation. Reliable systems also usually means happier users and engineers.  

Process recommendations that are anti-patterns

Some examples of a one size fits all approach that don’t work or are bad recommendations depending on your organization. At a minimum I did not find these theories to work in the real world.

  1. No leadership/managers in incident reviews - A recommendation that leadership not attend incident reviews and an overall sense of treating management as incompetent or adversarial. Leadership has to be invested in helping understand the systems their teams own and keeping them reliable. If they are not allowed to be present, they cannot understand the ways in which systems they are ultimately responsible for are failing.  While it’s likely sadly true that leadership can be adversarial or create a blameful culture in some organizations, assuming this as the default and shutting them out of the room, prevents teams that can be better.
  2. No discussion of follow-ups or mitigations. This comes from a place of trying to keep the focus on learning, however it comes in the way of the other more important goal which is of ensuring that the underlying systems are improved. To make systems more reliable, there needs to be an appropriate investment in tracking work that stems from the learning. If everyone in an incident review leaves understanding the system perfectly but no appropriate action is tracked and followed upon, the group is doomed to see repetitive incidents. This excellent blog goes into it in far more detail.
  3. No root cause analysis or no root cause analysis reports. In most systems, users are a key part of the equation. Sometimes with the focus on best practices in incident management user needs are lost. A key example is emphasizing no root cause and by extension no root cause analysis. This one comes from a good place where it's understood that focusing on finding a single “smoking gun” isn’t how most complicated software systems fail. Software systems fail in complex and unpredictable ways. However, when users demand a root cause report, what they really want to know is that the problem they encountered was taken seriously, remediated and is unlikely to happen again. Ignoring users' needs or worse, condescending to them, only alienates users who don’t care that you followed a blameless process, but that the systems they rely on stay up. 

These examples were the easy part, because in most cases, people try to follow a particular best practice, find it doesn't work in reality and move on. What follows are broader topics where I have found the general consensus approach around incident management research to not work in reality.

Metrics are good actually

There is a lot of conversation around bad, useless, or harmful metrics in incident management. It should be obvious that a metric that purely tracks and incentivizes something such as “number of incidents” isn’t a good idea. But this very reasonable idea has been extended to a push back on all metrics. A mantra commonly repeated is Goodhart’s law - “When a measure becomes a target, it ceases to be a good measure”. However, this is based on a misunderstanding of the law itself and its misuse which this article goes into in depth. The most important point from the linked article is this guardrail “Avoiding Goodhart’s Law requires you to also give people the space to improve the system.” This is very relevant to incident management where a cornerstone of improving underlying systems is giving the teams that own the systems space to make them more resilient. Focusing on useful metrics and ensuring guardrails against misuse allows for a better use of data vs ignoring of data completely. 

A little learning is a dangerous thing or beware of performative learning

The learning about complex systems cannot be decoupled from working on the actual system. For anyone that has both worked on live systems as well as participated in incidents learning the order of impact of learning in descending order usually is

  • Working on the system during the incident 
  • Following the incident live
  • Fixing follow ups in the system
  • Participating in learning discussions post incident
  • Reading about the discussions in an async manner

If investments are disproportionately allocated to the lower items, it doesn’t actually help with the goal of learning or making the underlying systems more resilient. Learning is only the path to the real goal of making systems more reliable and the best learning happens by working on the system. 

To summarize, incident best practices fundamentally need to be grounded in the world around them. They need to account for the people, teams, business, and users they are serving. I close with this comic that reminds me sometimes of the difference between theory and practice in many different fields