Operational Reviews

My journey building people-driven infrastructure

Jul 05, 2023

One of my passion projects at Stripe is working to improve the engineering culture of my organization. Last year, I had the honor of revamping the process by which my organization reviews the health of its systems on a weekly basis. What follows is an abbreviated retelling of my experience in this domain and the lessons I learned from building this infrastructure for my organization at Stripe.

Disclaimer: though informed by my experience as an employee of Amazon and Stripe, my thoughts and opinions are my own and may not be shared by my former or current employer.

Glossary

EM: Engineering1 Manager. Someone who manages teams of around 5 to 8 ICs.
IC: Individual Contributors, generally software developers that spend their time building and maintaining things.
MoM: Manager of managers. An individual that manages a handful of EMs.
SLA: Service Level Agreement. A promise (often contractual) to customers or users about how your service will perform or how reliable it will be.
SLC: Service Level Commitment. An internal commitment to how services should perform or how reliable they should be. Less formal than an SLA. At Amazon, the term “internal SLA” was used in lieu of SLC.
SLO: Service Level Objective. A goal set by a team for how well it wants a service to perform, but not as formal or broadly communicated as an SLC.

Note that these terms may vary in their usage company to company, or even within the same company. My goal is to define them as I am using them in this essay.

What is an operational review?

An operational review (ops review, for short) is any process put into place to periodically review the health of the systems operated by a team. To keep things simple, I will scope this discussion to teams that operate software services and other computer infrastructure.

My journey through operational review cultures

Amazon Marketplace

I joined Amazon Marketplace (part of the retail side of Amazon) in 2010, fresh out of college. In my early career at Amazon, the organization I worked in had a very minimal process for oversight of operations. Teams had metrics, dashboards, and their own monitoring, but there was no concept of an SLA on these services (from the perspective of ICs at least). EMs would be asked to discuss outages or unexpected behavior during internal operational reviews, but ICs did not take part in this process aside from the ICs ending their on-call rotations preparing notes on issues that occurred during their shift.

This lax approach could be attributed to the asynchronous nature of the systems my teams managed. The systems we owned either worked on asynchronous tasks as part of order fulfillment workflows, or provided information for merchant-facing dashboards and APIs, which were relatively low throughput. High latency and blips of unavailability could be tolerated as long things eventually worked.

Operational Reviews at Amazon Web Services

In 2015, I moved to Amazon Web Services, where I would work for 5 years as an software developer on the Key Management Service. AWS was a completely different beast. AWS consists of customer-facing functionality and APIs. Large enterprise customers build their businesses on these APIs and services. As a result, some customers have very high expectations. Even for services which do not have an externally published SLA, teams are expected to set an internal SLA (i.e. SLC).

In general, most teams would review their core metrics (those most representative of customer experience) weekly during their on-call handoff (including any metrics they have an SLA on.) My team in particular would review our most core metrics, and then also randomly select another dashboard to audit both for clarity and for issues illustrated by the metrics themselves. This meeting was minimally attended by the previous and next on-calls and EMs, but was also well attended by other ICs.

At the director level there would typically be another organizational ops review. This meeting would involve a review of issues and major incidents affecting the org and would involve a deep dive into one of the products under that director each week. ICs were generally not expected at this meeting unless speaking to a specific issue.

The cornerstone of AWS operational review culture was the company-wide Wednesday ops review meeting. The basic agenda of this meeting was to celebrate operational wins, review upcoming changes folks needed to be aware of (code freezes, migrations, etc.), dive deep into critical incidents, and lastly a dive deep into a random service’s dashboard.

This meeting was open to all in the company, and was regularly attended by senior leadership and the Principal Engineer community. Every team was expected to send a representative (whether this was an EM or IC was discretionary, as long as they were able to speak for the team and could present the team’s metrics if chosen).

Having had opportunities to present both incident reports and my team’s metrics at this meeting, I can attest that this was both an incredibly intense and rewarding experience. Intense, because you’re speaking to (and questioned by) some of the most senior engineers and leaders at the company. Rewarding, because the culture was blameless, and because of how clear it was in those moments how ruthlessly the business cared about getting the fundamentals right.

Stripe

In 2020, I moved to Stripe to work on systems related to the secure storage and processing of credit card data.

When I first joined Stripe, there was no operational review process as such that was visible to ICs. There were no SLAs at the service level. There was a healthy culture of dashboards, and at least within my organization, detectors that alerted us when things were amiss. We had an on-call handoff meeting, but it generally did not involve a retrospective of service health over the last week, aside from pages or incidents.

As Stripe grew, there was an increased focus on operational reviews. We started defining SLCs for our services and tracked if we were meeting the SLCs. A number of other dashboards were created to help teams understand their posture with respect to service health and initiatives at Stripe where action was expected to be taken.

In 2021, within my organization a weekly meeting was created to review these dashboards and commitments with EMs and some ICs. EM attendance was expected, and ICs were invited but not required to attend. The same meeting was also used to deep dive into incident reports for the organization, and was generally used to perform the “review” expected of an incident, unless a higher level of review was deemed required due to the severity of the impact. The meeting was facilitated by volunteers (usually EMs).

Taking the reins

While I was pleased to see Stripe mature to the point that we were regularly tracking our service health, there were several areas where I felt the process fell short:

The process was overly focused on the previous week and was not set up to reveal long term trends or potentially systemic issues.
The meeting didn’t have a clear agenda, and the audience was not well defined. Often the meeting did not have the right attendees to discuss more severe issues.
Folks in attendance from various teams were often not prepared to discuss operational incidents or abnormalities, and there was no mechanism to follow up after the meeting.
The meeting did not provide a forum for on-call engineers to share their perspective or concerns about system health.

In early 2022, my organization went through a restructuring and the result for me was a new leadership chain. As a part of this restructuring, we also lost the management sponsor for our operational review process. This presented an excellent opportunity to start fresh.

With full support from my EM and MoM chain, I took on the responsibility of rewriting our operational review process.

I started with interviewing folks that had expressed an interest in sharing an opinion on our ops review process, both positive and negative. In these meetings, my goal was primarily to understand what folks hoped to get out of an ops review (that is, what would make it useful for them), and what folks liked and didn’t like about the previous format.

With this information in hand, as well as my own experience, I drafted a document which described the shortcomings of the previous format, our shared goals for the new format, and a clear statement of purpose. I ensured we all broadly agreed on these points. Next, using these previous points as an input, I drafted an outline for what the meeting would look like. After reviewing the planned format again with the critical stakeholders, and anyone who had a strong opinion, it was time to dive in and try it out. We’ve been iterating and making small improvements ever since then.

I will spend the remainder discussing what we developed.

Shared Meaning

Having a weekly meeting where we come up for air and confirm to ourselves that everything looks like it is running well allows us to be more heads-down and productive the rest of the week.

The first challenge was developing a shared understanding of why we had this process:

Why are we having this meeting? Why should folks be here rather than doing other productive things? Why do we care about ops?

At AWS, where I walked into a functioning operational review process, this step was not necessary. I believe a meeting that folks find useful can be a justification in of itself, even if you cannot put it into words. That meaning might even be partially of symbolic value. I think this was sometimes the case at AWS. The size and attendance of the Wednesday ops review meeting and who attended broadcasted a very clear message: “operations are important here. Our customers care. We care. We will make sure you understand the health of your systems. If something goes wrong, we will be working with you to figure out a solution.”

That was a message that came from the top. With that signal, local ops reviews at different levels tended to fall into place.

At Stripe, that shared culture did not exist to the same degree.

And that meant defining a purpose for ourselves. The challenge here was to define a statement of purpose that resonated with both ICs (whose participation we wanted to encourage) but also was pragmatic in that it reflected the needs of EMs and MoMs to have a clear picture in their heads of how we’re doing so they can both communicate to their leaders and prioritize work.

The “purpose” was thus synthesized from input from both the EM and IC interviews.

Here are the goals we came up with for the meeting:

Verify our systems are behaving as expected and understand the ways in which we are not.
Review the health of the people-driven processes that support the operations of our services (e.g. pages, on-call workloads) and ensure workloads are sustainable.
Verify that for recent incidents we’ve learned the correct lessons, correctly assessed impact, and the appropriate artifacts have been generated.
Understand our long-term trends and ensure our services will deliver the expected functionality over the next 12 months.

When a more succinct summary is called for, I say the purpose of the meeting is “situational awareness.” Having a weekly meeting where we come up for air and confirm to ourselves that everything looks like it is running well allows us to be more heads-down and productive the rest of the week.

Facilitation

Even with an agreed-upon purpose in place, we decided a facilitator is useful for consistency.

The facilitator of this meeting is responsible for making sure:

The meeting is useful.
The time is well spent.

Ensuring the meeting is useful and efficiently run helps drive continued participation.

This, of course, involves the usual duties of time management, making sure notes are recorded, agendas updated, and so on.

To ensure the content is useful, and the time is well spent, the facilitator is also responsible for following up with owners of content in the meeting to make sure they are prepared is essential. Folks should not be scrambling to figure out what they need to say, nor should they be surprised about what might come up during the meeting. This also means letting presenters know if we’re going to ask about an incident, for example.

During the meeting, the facilitator runs the presentation, shares visuals for the most part, cues others when it is their turn, and manages time.

We currently rotate facilitator duties within a group of interested volunteers. The goal is to balance redundancy (so that there is always a facilitator available) while making sure the facilitator has the experience and interest in fulfilling the responsibilities.

Facilitator as Auditor

Don’t assume that everyone besides you understands it; be the backstop to misunderstanding.

There is an additional role that the facilitator must play that is important and unique enough to warrant its own section.

The facilitator is responsible for ensuring we’re holding ourselves to a high bar for clarity and understanding.

Other contributors are responsible for reporting on their systems, incidents, etc., but the facilitator is responsible for ensuring that the information presented makes sense, is understood by everyone, and to ask the questions that other folks might not be asking.

Unfortunately, this is the most difficult aspect of the role to teach. In general, my advice to prospective facilitators is to trust your gut: if something doesn’t make sense, or you have a question, ask it. Don’t assume that everyone besides you understands it; be the backstop to misunderstanding.

If a question is important to our understanding of our systems or an incident, but no one is able to answer it during the meeting, ensure it is assigned as an action item at the end to follow up. Make sure it is followed up on.

Structure and Format

Attendance

We request that all EMs and the organization leadership (including any technical leads) attend, if at all possible. Aside from that, each team is expected to have folks present which can speak to the operations for the team for the last week.

The attendee or attendees for each team are responsible for preparing slide content assigned to their team.

Anyone else interested is welcome to join.

Contents and Structure

The basic structure of the ops review format we’ve developed has four sections.

Each section has a clear owner, and the owner presents.

Across the four sections, there is the common goal presenting information which shows deviations from the baseline. This is in line with making sure the time is well spent and the content is useful. If there is nothing to report, the meeting progresses quickly and we spend time on other sections (or end early).

Team/Service level metrics review

Each team is responsible for discussing anomalies from how their services usually operate. The goal is to give each team space to speak to how the last week went for them, but to use the time efficiently by only speaking to deviants from the norm. We ask each team to independently review their operational metrics (defining operational metrics could be its own document), any incidents that they were involved in (this does not necessarily mean caused), and lastly any FYIs they have for other teams in the organization.

In our case, we also selected some overall organization metrics and assigned responsibility to the Tech Lead for our organization to gather and present these each week.

Projects and Migrations

Any team doing work, or aware of work being done by teams external to the org, that impacts teams in our organization can use this space to share updates.

In our case, we use this most frequently to inform teams about datacenter work. We also notify folks of database migrations, or other changes to how services operate.

If any systems owned by the organization have been flagged (through an incident or otherwise) as an area of risk or concern, we ensure we reserve time to review trends for that specific system as well.

This information is presented by the teams involved or most adjacent to the work.

On-Call/Runner Metrics

We use this section to review the health and sustainability of our on-call.

Key metrics are the number of pages and tickets. We also ask offgoing on-calls to submit subjective scores for their run experience. This provides EMs and our MoM visibility into the sustainability of the human side of our operations, and to ensure on-calls are not at risk of being overwhelmed and burned out.

As a practical matter, this section is prepared by the facilitator from the available data.

Incidents

For each incident where our organization was implicated, the owner of the incident report prepares a summary of the incident. We include a summary of the event, the proximate causes of the event, and action items.

We do not do a deep dive into the incident, as we found that it is better to have a dedicated meeting for those purposes. That way, the incident review can have the appropriate invitee list, which may be different from what is appropriate for the ops review.

Format and Implementation

When we originally started running this meeting, participants were very strongly in favor of using a Google Doc template for each meeting, with the thought that this would be easiest to fill in.

After a couple of weeks, we decided as a group to move to a slide deck format. Using a presentation format made it easier to divide up the content and assign ownership per slide. The format also encouraged brevity. Lastly, it was easier for the presenter to share during meetings.

We maintain a slide deck that is cloned for each week’s meeting. The template is updated as we make improvements to the format. Speaker notes are used in the template to provide instructions for where to pull data from and suggestions for how to use each slide.

The slide deck used for each meeting, as well as a video conference recording, are retained and linked to from a confluence page.

Before the Meeting

The meeting takes place weekly, and covers events from the previous week.

On Thursday before the meeting, an automated Slack reminder goes out to the facilitator to remind them to clone the presentation template, fill in dates, and update links. This ensures it is available for folks to start filling out.

Before the meeting, an automated Slack reminder goes out to on-calls to remind them to fill out the deck. The facilitator is also responsible for identifying if there is an incident that needs to be discussed this week and reminding the responsible team.

The facilitator is responsible for gathering data for some common aspects, such as the on call metrics.

Lastly, before the meeting, the facilitator is responsible for nudging any teams which have still not provided content for their sections.

Running the Meeting

At this point, running the meeting is fairly straightforward.

The slide deck acts as the agenda for the meeting and the meeting should just go through slide by slide, allowing owners of each slide to present content when it is their turn.

The facilitator should ensure the meeting runs on time and politely cut off discussions that may lead to not getting through the content for the meeting.

Lastly, the facilitator should ask questions themselves (preferably after others have had a chance to ask) to ensure all information presented is clear.

Afterwards

The facilitator is responsible for sending an email with a link to the recording and slide deck. The email should also clearly summarize any action items that were assigned during the meeting, or actions that need to be taken by folks.

The same should also be recorded on any wiki pages for future reference.

Measuring Success

Given the amount of work that had gone into planning this format, there was a strong desire by our organization leaders to understand the impact this new structure was having and the sentiments of participants.

We initially attempted to do this by sending out small surveys to participants after each meeting. Even though we kept the surveys short (2-3 multiple choice questions), I observed that people get quickly burned out filling one out each week, and the data quickly becomes sparse.

I found it was more useful to ask folks during our regular one on one meetings for feedback. This also provides the opportunity for folks to share more nuanced sentiments that might be lost in a multiple choice survey.

To this day, we continue to iterate on the format, how we collect and present data, and what items go on the agenda. The greatest signal I have that I’ve built something valued and enduring is that I am not the only contributor. Other folks in the organization bring their contributions to the format, and I overhear them evangelizing it to others.

I have been especially pleased to see the meeting continue to run itself while I am out on parental leave, and I am excited to jump back in later this year and see how things have grown and the improvements folks have made.

I have a lot of thoughts on the use of the term “engineer” in my field. I (and most of my coworkers) do not have proper engineering degrees. I usually identify myself as a “software developer” unless the title “engineer” is explicitly forced on me.

Bits and Being

Discussion about this post