Disclaimer: my opinions are informed by my time at Stripe and AWS, but my thoughts are my own and not necessarily shared by my current or former employers.
Consuming a large amount of information about how systems fail is one of the greatest privileges of working at Stripe (and formerly in Amazon Web Services). “The greatest teacher, failure is,” as Yoda says. Both my current and former employer hold an incredibly high bar for how they operated, and as a result, there is a great amount of information available on even the smallest failures; little failures that happen every day but generally go unnoticed by our users.
This year, I set a goal for myself to sit in as many of these incident discussions as possible to try to extract some common threads to share back to the organization. As I head out on the second part of my parental leave, I wanted to pause to share some of the observations I’ve made so far that are more generalizable and appropriate to share externally.
The need for testing was a common theme, so I will devote a first post to just testing. Part 2 will cover the remaining learnings.
Testing a change to make sure it will work as expected is evergreen advice, and the impact that lack of testing has should not be surprising. During this most recent period in my organization, a low-double-digit percentage of incidents could be attributed in some part to a lack of testing. I will highlight some of the specific types of issues I have observed.
Testing in the code base
The simplest form of testing failure pertains to the lack of tests in the first place. A service may have no tests at all, tests may only cover a very limited number of code paths, or provide minimal assertions around the correctness of code.
What can be done?
“Write more tests” is an intuitive response to address these issues. But it might also be the laziest possible remediation item in these cases because it ignores why these tests are lacking in the first place. If testing is not being properly performed by engineering teams, there is likely an underlying cause, such as the friction involved in adding use cases, or the perceived or actual utility of these tests.
For example, the following may add friction to proper testing:
There are no frameworks for running tests at the appropriate depth (if a pattern is established for unit tests, it may not be easily extendable to functional/integration1 tests).
There are so few existing tests engineers feel it is an unacceptable cost to create tests as a part of their planned work.
The testing framework and overall testing strategy is not well documentend.
There may also be deficiencies in the utility of tests in the code base. This may be because:
Tests exist but assertions are superficial. They may validate that the code executes, but critical assertions may not exist to validate the behavior of the code.
Useful mocks for dependencies do not exist, making it difficult to have confidence in the correct behavior of code which interfaces with dependencies.
Failing tests are not used to gate merges or deploys.
Test failures do not correlate well with actual issues in the code (see the section on Testing Observability below.)
Fixing testing deficiencies is not a trivial matter. For testing infrastructure to be useful (and used) it must be (and recognized to be) a tool that makes a task simpler.2 It requires investment the same way we invest in deploy and orchestration tooling (and quite possibly much more investment). This is all the more difficult in environments, such as Stripe, that require synthetic or “mock” third-party components to test against, and even more so if the ownership is not clear around who should create those mocks.3
A few approaches I have seen be useful in improving test coverage over time:
Standardizing the testing frameworks and patterns used within a code base to reduce the cognitive load of writing new tests.
Requiring minimal test coverage for a change before merging code.
Building tools which allow engineers to visually inspect test coverage for the code they’re working with. (This makes it more obvious where coverage is missing.)
I would avoid anything which makes individual engineers feel responsible for the overall code coverage for a service. This is a valuable metric for leadership to track, but is difficult for engineers to address while delivering other work. It is much more effective to show opportunities for improvement in the context of existing work.
Testing at scale
Beyond basic functional testing, testing at scale is an additional challenge. It is particularly problematic for infrastructure teams and services that maintain stateful components.
Periodic load testing where a call path is pushed to its limits is a critical piece of engineering for reliability. However, this can be very challenging to do exhaustively. Call patterns in production may not be easy to replicate.4 Load can also come from surprising places, especially for infrastructure teams. For example, infrastructure teams might face a thundering herd to its orchestration systems when another team attempts to scale up to mitigate its own issues.
What can be done?
Even though load testing is not a panacea, it still plays a valuable role in the overall operational readiness of a service. It is all the more effective when paired with a diligent analysis of real life use cases and pathologically bad, worst case scenarios. That is to say, testing is not as simple as pointing a load test against an API and writing down the Requests Per Second (RPS). A thorough load test should take into account:
Current, actual customer behavior, so that the load test will accurately replicate the type of load currently placed on the system.
Worst-case use cases which are not commonly seen, but would stress the system. It is important to understand where rate limits and load shedding may be needed to protect the service.
Future business needs, in case the current customer behavior is subject to change.
Whether real dependencies of the service will be used in the test (and if not, an honest description of the limited usefulness of the result).
How components behave differently under synthetic traffic versus real-life usage patterns. For example, caching (both explicitly used by the service or used implicitly by dependencies such as databases) may behave very differently depending on how synthetic traffic is constructed.
A thorough load test should report on:
The maximum throughput a service can handle while maintaining latency and availability targets.
The component or system which limits performance, so that it is clear where future investments would need to be directed in order to improve performance.
Lastly, teams should take action on the outputs from their load tests:
Teams should create alerts to notify them when the current load approaches the theoretical maximums. The greater the uncertainty about the theoretical maximum, or the greater the difficulty in raising the maximum, the lower the alerting threshold should be. Reviewing these metrics on a weekly basis can also provide further advance notice of potential scaling issues.
If appropriate, service owners should set rate limiters to protect services based on known maximum supportable throughput.
Load tests should be performed on a regular cadence throughout the year (especially in anticipation of periods of high load), as well as after architectural changes.
Test in the right environments
Where to test can pose an interesting challenge. Commonly, a shared QA/Beta stage is established in which everyone is supposed to deploy code and perform validation of changes before deploying to production. The result is an environment that is frequently broken because it contains changes being tested that do not work as such. This can make it difficult for an engineer to be confident about any single change. Breakages can become so frequent that engineering teams become accustomed to ignoring signals from errors in this environment.
This can be especially problematic for infrastructure teams. Engineers that rely on stable deployment tooling to be productive with their own tests in QA. They may not appreciate being beta testers for new versions of tools that impact their productivity.
What can be done?
I have seen two solutions in practice.
The first is to have a series of testing environments, where the quality ratchets up as you go along. While the first environment in the pipeline may be a free-for-all, the final environment before production is expected to be more stable and provide a high signal-to-noise ratio. In AWS I saw as many as 4 stages before production.
The other approach I have seen (sometimes used in conjunction with the first) is to have separate testing environments per team or organization. An infrastructure team may have a QA environment, while other teams’ QA environments would use the production infrastructure and dependencies. This isolates teams from each others’ testing.5
In cases where it is not appropriate to use a production dependency for a test environment, consider a post-production environment (deployed to after production in the deployment pipelines) which exists for teams to test their candidate code against.
Another option (which I have not seen completely implemented but have always wanted to) is to provide engineers a way to spin up bespoke testing environments which are effectively a clone of the production state of the world with just the component to be tested replaced with a candidate.
Testing observability
Lastly, for testing to be valuable, there must be confidence in the signal provided by a testing environment. Unit and integration tests must not be flaky such that engineers are used to ignoring failures. If a pre-production or staging environment is used, alerts must be in place to notice that a potentially breaking change has been introduced before it goes into production. And those alerts must again have a high signal to noise ratio such that the warnings they provide are not ignored.
What can be done?
I will devote more in Part 2 about avoiding the normalization of deviance. For now, suffice it to say that any system which is used to gate code being deployed to production must be treated as production in terms of quality. This means having similar alerting as production, and an urgency to fix issues that is second only to an actual production issue.
Engineers must also be given adequate alternatives for testing such that the final validation stage is not the first time when their code is being exercised. Otherwise, the constant need for urgent fixes to address recurring breakages will not be sustainable.
Conclusion
This concludes my thoughts and observations on testing. While my recent observation of Stripe infrastructure incidents provided the inspiration for writing this essay, the experience I draw from is much greater and indeed encompasses my software development career to date.
Please join me for Part 2, where I will cover other common contributors to incidents, as well as some other recommended best practices.
Thank you to Yanjie Niu for editing this post and providing thoughtful feedback.
I am deliberately avoiding spending too much time on testing nomenclature because it appears to vary quite a bit between companies. For example, what were called Integration tests at Amazon are called Functional tests (at least in the teams I have worked on) at Stripe and Integration tests are a different thing.
In Disney’s 1998 Mulan film, the army conscripts are asked to climb a pole to retrieve an arrow from the top. However, they are given two weights, representing discipline and strength. Many attempt to climb to the top while carrying the weights, but it is Mulan who realizes that these are not burdens but essential tools for completing the task. (Strange, given how on-the-nose the metaphor is.) In software engineering, our weights are testing and observability. You can either reluctantly drag them along, or use them as a powerful tool.
In my previous role at AWS, working on the Key Management Service, testing was a relatively simple prospect. No matter how complicated the infrastructure became, the correct behavior of the system was simple to define: cryptographic functions have a deterministic result, and it is relatively trivial to verify the behavior of the service. (I am simplifying a lot here. There are a lot of aspects that were not quite as simple as verifying our AES GCM API did its AES GCM things. Authentication, for example, adds some complexity. However, for AWS, all of this is at least handled in-house and we have the information to write useful assertions.)
The payments space, by comparison, is much more complicated. A payment processor provides value by causing money to move in the real world, and that money movement is done by reaching out to a diverse population of financial third-parties. Not only is this done through APIs for which we don’t have iron-clad specifications and for which there don’t exist mocks or simulators, there is often a non-trivial amount of third-party-specific infrastructure in the mix, which is not always possible to create a testing version of.
To summarize a bit: it is much more difficult to test changes where your code is supposed to have external side-effects, especially when those side-effects are related to money.
And in the event of something actively malicious, such as a denial-of-service attack, the call pattern may be specifically crafted to be pathologically harmful to your service.
For example, in AWS, teams building services which depend on S3 rarely use a non-production S3 endpoint to test their changes.