Preventing Incidents, Part 2: Everything Else
Normalization of deviance and other things you should think about
Disclaimer: my opinions are informed by my time at Stripe and AWS, but my thoughts are my own and not necessarily shared by my current or former employers.
This is part two of my reflections on incidents so far in 2023. You can find part one here.
To recap: consuming a large amount of information about how systems fail is one of the greatest privileges of working at Stripe (and formerly in Amazon Web Services). This year, I set a goal for myself to sit in as many of these incident discussions as possible to try to extract some common threads to share back to the organization. I’m taking the opportunity now to share some of the observations I’ve made so far that are more generalizable and appropriate to publish externally.
I devoted part one to testing. For part two, I will go into some other areas.
Change management
Change management often gets a bad rap. When I speak of change management, I am specifically talking about the policies, procedures, and tooling used to manage infrastructure changes and software deployments. This usually centers around a form that should be filled out to describe a planned change and track its execution, but it also encompasses the policies and practices around when such a form is used, who must approve changes, and so on.
Change management can have a bad reputation because of how it is perceived to be (or actually becomes) a bureaucratic process that gets in the way of productivity. However, change management, when applied appropriately, can be a powerful tool in de-risking work that can’t otherwise be de-risked with other mechanisms such as testing.
Lack of change management can lead to issues such as missed steps during rollout, an inability to detect when something has gone wrong, and an inability to roll back.
What can be done?
There are two separate points to discuss: when to use a more formal change management process, and what such a process should look like.
From my personal experience, a formal change management process is more effective and more respected by engineering teams when it is used sparingly and limited to one-off changes which are not a routine part of operations.
Routine changes, such as software deploys, are best handled by automated tooling which has safeguards built in and takes humans out of the loop. Automation should be used wherever possible, and manual change management processes used where manual activities must be performed by humans.
There are much richer resources out there on developing a change management process, so I will just cover the major points.
A change management document minimally requires:
Description of a change to be performed
Procedure to follow to safely complete that change
Review / approval process for the description and procedure
Tracking of the change while it is being executed
This can be done in a dedicated tool, a Google Doc, or anything in between.
What is most critical, however, is getting the description and procedure right.
A description should of course discuss the change being made. However, it must also acknowledge the potential risks. It may go as far as to link to a pre-mortem exercise run by the team, where the team performing the change has brainstormed all the ways the change could go wrong. This is critical information both for the person writing the procedure, but also for anyone reviewing the procedure for soundness.
The procedure itself must include:
An outline of the steps required to perform the change. Enough detail should be in the procedure such that one does not need to rely on outside resources in order to execute the change. If this is not possible, necessary outside links must be provided in the procedure. This allows everyone reviewing the document to be sure that their understanding of the steps to be performed matches what actually will be done. I also recommend breaking the steps into sections or “acts”, with clear breakpoints where the procedure can be safely paused (or must be paused).
Rollback instructions, preferably for each “act”. The rollback instructions provide details on the steps needed to move the system back to the previous state. This is a place where it is easy to get lazy, but the rollback instructions must be written with the same rigor as the procedure. (You’ll be glad this is the case if you’re actually executing the rollback instructions.)
Observability. For section, it should be clear what signals will be monitored to check if the system is healthy or if a rollback must be performed. The procedure should include not only the signals to be monitored, but what healthy/unhealthy signals look like.
Lastly, if the procedure that is about to be executed is new (and especially if it involves any bespoke tooling), it should be tested in a pre-production environment. Not only should the happy-path be tested, but any rollback procedures as well. The gold-standard for a rollback procedure would be one that you’ve tested in production.1 If possible, game-day failure scenarios in the rollout and ensure the rollback procedures handle them. A rollback option that hasn’t been tested may as well not exist.
Observability
Observability will rarely be the cause of an issue, but the ability to notice when something has gone wrong and quickly identify what that something is can certainly mean the difference between minor and major impact.
What can be done?
Modern tools and libraries can make creating extensive dashboards and alarms easy, but it is important to build with intention, and continuously refine the signals you’re looking at.
Dashboards should “tell a story” about what is going on. The “so what” of every graph on a dashboard should be clear; if not self-evidently, then at least by a written description. Graphs which don’t provide a useful signal should be removed or at least demoted to a less visible location where they won’t confuse someone trying to understand the state of a service.
There should be a strong signal to noise ratio. Developers should not become accustomed to graphs which “look bad” or behave erratically, even if the service is behaving as expected, because it will significantly lower the chance that someone notices a real issue in the future.
Ideally, your tool should allow the ability to quickly dive into the data when a deviation is apparent. Error rates up? It should be possible to relatively quickly go into the logs of a failed request. Latency up? The faster you can get to a latency breakdown, the better.
Alarms that actively alert operators must also have a high signal to noise ratio. Folks should not become accustomed to ignoring alarms. However, alarms must reliably alert when an actual issue occurs.
I also strongly believe the user experience of the tooling you use to implement observability is critical. If creating a dashboard or alarms is difficult or confusing, folks are much less likely to create the right observability signals or keep them up to date.2
Lastly, always have a plan B for if your monitoring itself goes down. In the worst case, this may look like “SSH to some critical services and tail logs” but have something in mind, and most importantly, write down what you would do.
Runbooks
Runbooks in this context are procedures written for responding to alarms. Like observability, good runbooks may not prevent something from going wrong, but they can certainly help stop things from going from bad to worse.
What can be done?
All alerts to developers must be accompanied by a runbook.
The most important trait of a good runbook is that it provides clear, unambiguous guidance for what to do. Remember, your runbook may be followed by a relatively inexperienced developer at 1 am.3 This doesn’t mean it is possible for a runbook to identify a remediation for every single possible type of failure, but it should at the very least provide clear guidance on where to find logs and metrics. If your runbook can get you to the bottom of the stacktrace or point you at the broken dependency and the contact for that dependency, that is pretty good.
Having a template can reduce the friction of writing runbooks while also providing some consistency. While at Stripe, I worked with another developer to develop a template to use for runbooks going forward. We came up with the following sections:
1-2 sentence description of the detector and what it is telling you.
1-2 sentence description of the potential impact, so the responding operator understands the severity.
Temporary alerts or callouts about ongoing work, which may impact this detector. For example, “we’re currently rolling out feature x, which could trigger this alert. If this alert fires, start by turning off y feature flag.” These are not expected to be a permanent part of the runbook.
Troubleshooting steps. These are expected to be run in order. Each step is also written in a clear “if this, then that” format. We also used a visual language to clearly label links to logs and graphs, so that they’d stick out, and folks would not be surprised at where a link is taking them.
Escalation instructions for if the runbook has not resolved the issue and next steps are not otherwise clear.
Lastly, schedule time (at least annually) for the team to review its runbooks and make updates to make sure all the information in them is still accurate.
Incident reporting
By “incident reporting” I mean the process of documenting, reviewing, and sharing lessons learned from incidents. Every incident is an opportunity to identify new safety mechanisms that need to be built, practices that need to be changed, and so on.
What can be done?
Establish a culture of following up on incidents, identifying causes, planning remediation work, and sharing out lessons. What this exactly looks like will depend on the existing institutions of a company. Establish a bar for what constitutes an incident, and what level of review is appropriate for that level of impact.
Create a template that captures the information that should be collected for each incident. A timeline and quantified impact are great for hard data, but I have found there is a large amount of value in creating space for a team to write a narrative about what happened through the eyes of the folks that worked on the incident. A “five whys” approach can be useful for unblocking thinking, but also has the risk of being applied too rigidly.
Incident reviews should be blameless, in that the blame should not be laid at the feet of specific engineers or teams, but should rather strive to understand the human-computer interactions which led to the incident taking place (if applicable).
Lastly, it is important to recognize the work that goes into writing an incident report and facilitating the review around it.
At some future date, I will write a post about the operational and incident review process I developed for my organization at Stripe.
Conclusion
I’ve covered a lot of different topics in these two posts, so before wrapping it up, I would like to discuss a few overarching themes that I’ve alluded to.
Normalization of deviance
The greatest risk to any of the best practices I’ve mentioned is the normalization of deviance. In short, the normalization of deviance is when the habit of skirting or not following best practices becomes culturally acceptable within an organization.
Some examples of normalizing deviance could be not creating or updating unit tests for a change, performing risky manual actions in production without following an organization’s change management processes, or creating detectors without runbooks. This is only a partial list; normalization of deviance from any practice is possible.
If there is one act that will torpedo any attempts at operational rigor, it is a culture of normalizing deviance.
The most effective antidote is to quickly identify cases where a process is not being followed and then either:
Explicitly and visibly prioritize diligence around wherever it is you’re slipping to bring the organization habits back in line.
Edit the process so that what is written on paper more closely aligns with what folks believe is the more appropriate level of rigor.
Slow is smooth; smooth is fast; safety begets efficiency
While the principal topic has been reducing the occurrence and severity of incidents, the suggestions in this post and the previous one are not just about safety and reliability. A company which has built tools and processes to execute safety is also one where engineers can execute with more confidence.
This is most evident when it comes to testing: if an engineer can edit a code base to add/remove/change functionality and have confidence that the testing infrastructure will catch any unexpected side effects, that engineer can move their code through the pipeline faster and deploy with more confidence. Reviewers of the code can also focus their attention on the intended behavior of the change, knowing that tests largely have their back.
Similarly, mature change management practices can remove the toil and uncertainty around rolling out a big manual change. Even if the process can feel bureaucratic to teams, at least it can be of a known quantity. A predetermined approval process can also reduce churn and allow a team to move forward confidently.
All of this is to say that proper investment in the areas discussed in these last two posts can yield not only decreased downtime, but also greater developer productivity.
Further Reading
I want to take a moment to acknowledge that this post and the preceding one have only begun to touch on everything involved in building and maintaining resilient systems. I haven’t even touched on writing software and services to be more resilient in the first place! For that topic, there are many great books out there, of which I’ve read absolutely none, but I would probably start with the classic Designing Data-Intensive Applications.
For more on operating large scale systems, Google’s Site Reliability Engineering book remains a classic, though I will warn about considering it the final word on any topic.4
If you’ve made it this far, thank you for reading! I look forward to sharing more thoughts and tales from the trenches!
The rollback procedure itself must be safe enough that folks are comfortable executing it in production.
See my upcoming post, “Why SignalFX is actively harmful.”
This is not the time to make someone read an essay or architecture design document in order to understand your alert.
For many years, I have hoped that someone at AWS would write their own O’Reilly book on the same topic.