This is a story about I shipped some really cool stuff at Amazon Web Services, my approach to executing on complex projects, and my advice to others.
Previous to my current position at Stripe, I was a software developer at Amazon Web Services (AWS), working on the Key Management Service (KMS) from 2015 through 2019.
The AWS Key Management Service (KMS) is, as the name implies, a service for managing cryptographic keys in the cloud. In short, the service allows you to create keys, use them to encrypt and sign data through an API, and grant or restrict access to the keys using a few different access control mechanisms. AWS KMS is used by other AWS services to protect customer data, and some customers also build applications directly on top of the KMS APIs.
Disclaimer: though informed by my experience as an employee of Amazon, my thoughts and opinions are my own and may not be shared by my former or current employer.
While working on this team, I had the privilege of leading the development of two high-impact products:
The Bring Your Own Key (BYOK) feature allowed customers to import key material into AWS KMS from their own, on-premise HSM infrastructure.1 This provided an option for customers that wanted to take responsibility for the generation of keys themselves and/or did want their keys to exist only within AWS.
Custom Key Stores (CKS) allowed customers to create a KMS key that was backed by a key in a CloudHSM cluster they controlled.2 Similar to BYOK, CKS gave customers a way to use KMS while meeting requirements they may have to retain key material in devices that were more similar to traditional on-premise HSMs.
Both features were key in enabling strategic customers to move to AWS. It was truly an honor to be a part of project work which grew the cloud business by such a significant amount.3
However, what really excited me about this work was how we executed. And that is what I want to write about today.
Both projects were performed under time pressure from the AWS business and customers that were eager to utilize these features. As the lead engineer, I was responsible for scoping the work involved and arguing for the time and resources we would need. In both cases, I was given both less time and fewer people than I originally thought were necessary.
Otherwise, where would be the fun in it?
(From this point on, I will jump between discussing the two different projects because there were a lot of similarities, and also because I can’t remember exactly which memory was from which effort.)
Planning
I found when the project planning really kicked into high gear by looking at my Amazon order history for “red string.”
I exaggerate, but not that much.
My goal at the beginning of both projects was to get a birds-eye view of the work that would need to be completed, the effort required for each piece of work, and the dependencies between them. Having this laid out at the beginning would allow me to have an approximation of how long the project would take to complete. By understanding the dependencies between the tasks, it was also possible to understand the impact on headcount to the delivery of the project.
My notecard and red string approach went like this:
Write a note card with the end goal, such as deliver feature X.
Think about what composes that end goal. In the case of an AWS product, this might be “API for X”, “console website for X”, “documentation for X”, etc. Attach each of these with a red string under the main goal.
Repeat for each subgoal. For example, “control plane API”, “data plane API”, etc. If at any point, there is a dependency between two tasks, connect them with a red string and make the task that must be completed first lower.
Repeat, decomposing tasks into smaller tasks (design API, review API with committee, etc.), until you’re down to tasks which can be completed by a single person in a couple of days, and they are not blocked by other dependencies.
The end result is effectively a dependency graph. For each task, I would add an approximate time estimate. The leafs (tasks with no other tasks connected below them) should be tasks that developers can pick up and get started with. The longest path (adding up the estimates on all the tasks) from a leaf to the root is the shortest theoretical length of time the project could take, assuming human resources were not a constraint.4
As the project progressed, it would be easy to keep an eye on the critical path, add tasks that were uncovered, and identify work that was ready for someone to pick up.
This is more or less how I coordinated the work for the BYOK feature.
The downsides of this approach were that it relied on me being close to the board with all the cards and string in order to answer questions about the project, so for the CKS project work I acquired a Microsoft Project license5 and repeated the exercise, except with gantt charts.
The advantage of using Microsoft Project was that the information was more portable. It was also in a format that other stakeholders such as project managers could consume and understand. However, the downside is that Microsoft Project asks a lot of questions, and puts an over-emphasis (in my opinion) on calendar dates when laying this out.6
However, the start-with-goal-and-work-backwards goal is overall the approach I continue to use today to both understand the work a project will require and to provide accurate timelines.
Consensus building
Both the BYOK and CKS features involved expanding the capabilities for the AWS Key Management Service, and therefore were under a lot of scrutiny by the security-minded folks of AWS Cryptography (of which there are many). An important part of delivering both projects without unexpected delays was ensuring there was consensus within the principal engineering community insofar that the security of the system was concerned.
An objection later in the project cycle, even if it were to be eventually withdrawn, would risk disrupting project timelines. Therefore, early on I made sure to meet with certain key Principal Engineers, one on one, to hear out their concerns for the feature, understand their recommendations, and explain my thinking about how we’d ensure the security of our system would not be compromised.
As the project progressed, and certain aspects of the design were refined or clarified, I continued to check in to ensure there would be no surprise concerns or objections raised close to the deadline.
In all cases where concerns were raised by principal engineers (not necessarily about security), I prioritized responding in writing with the mitigations we had planned and our thinking on the topic. Whatever I wrote would also become a permanent appendix to the design document portfolio for the project, so that we would not have to rehash the same issues over and over.
What we needed and why we need it
For both projects, there was time pressure to deliver artifacts sooner than (at least on paper) the requirements and resources would allow for.
This pressure came from two different sources:
External: we had made or wanted to make commitments to strategic customers about when a feature would be available in order to grow business and win contracts.
Internal: both projects were seen as risky and pushing the capabilities of our technology stack; there was a desire to know sooner than later if there would be unexpected difficulties.
The solution to satisfy both concerns was an incremental approach to delivery. I’ll speak more about reducing the risk of project execution for internal stakeholders momentarily, but I want to speak to it from an external, customer facing perspective first.
For both projects, we knew of specific customers that were particularly interested in the new features we were building. Indeed, it was critical that the product we built would meet the requirements of these particular customers. However, large customers rarely will adopt a new AWS product into their production environment on the day they are given access. Just as we need time to build something, customers need time to validate it meets their requirements and understand how they will integrate it. This can take months (or even years for larger enterprises).
With this understanding, we prioritized the customer-facing aspects of the project work which would best allow us to collect feedback from the customers. We built APIs with the backends stubbed out, as well as documentation, that would give strategic customers the opportunity to start playing around with our product in a non-production, even if it would not be fully functional.
Just as importantly, this approach provided us an opportunity to change direction earlier in the product life cycle in a way that would be much more difficult if the backend functionality had already been built out.
Validating early
“Build a skateboard before a car.”
I’ve heard this advice many times, but it can be difficult to apply. After all, a car is very little like a skateboard and the use cases are generally pretty different for each.
Reframed, the advice I would give to other engineers is to do the riskiest work up front. If you’re not sure you know how to make wheels that spin on an axle, make sure you can do that on a skateboard before you go about building the entire car.
In the realm of building novel key management products, this has most often meant validating that the new infrastructure can provide the throughput (operations per second) at the desired latency (time to complete each operation) early on. More generally, it means first doing the parts of the work which are least like what you’ve done before.
Ideally, after each project milestone, behavior of the entire system should be revalidated. For example, raw benchmarks against the hardware were satisfactory, does it still work when we add network hops? How about after we add authentication and authorization checks? How about after logging?
If there is a regression, it is useful to be able to narrow down where it happened. The worse case would be getting to the end and finding that none of the performance requirements are met.
Priorizing for iterative development
Validating often is easier when validating is easy.
The Custom Key Store (CKS) project specifically required a lot of new infrastructure components to be built. Each performance and functional test required a lot of up front orchestration work. For this reason, we spent a large portion of our early time building the test harness which would allow this testing to be performed automatically.
It was a risk to spend so much time on automated testing before the product was even at a minimum level of functionality, but the investment paid off greatly. Every change made by any developer on the project was automatically tested for functional regressions. This saved weeks of developer time over the course of the project on testing alone, and once the testing infrastructure was there, allowed people to proceed with a much higher level of confidence.
Similar to automated testing, ensuring the continuous integration and deployment infrastructure is working early on also allows a project to proceed at a faster pace.
After launch
Because of the constraints we were often working under, a product seldom shipped to customers with everything we initially had wanted. Often there were features excluded from the API, the web console, or issues on the backend that impacted operations for a short period.7
After completing both projects, I took some time to compile a roadmap document. This document outlined the known deficiencies we’d need to address, work we might need to do to scale up in the future, and opportunities for growth later on.
Writing this down was valuable for two reasons: it closed the loop on the project with my leadership, providing a sort of bookend to the work while still communicating there was work to be done in the next planning cycle. Because it was written, it also allowed other engineers on the project to add their feedback and point out anything I missed.
Wrap up and takeaways
I would like to conclude by drawing attention to the fact that the title of this essay is “how we did it”, not “how I did it.” I worked with a fantastic team at AWS; truly one for the books. Nothing would have been possible without my excellent colleagues, my leaders, and the superb Principal Engineering community at AWS.
What I would like folks to take away from all of this:
Product management is critical. It is a gift to work with leadership who take the time to understand the needs of customers below the superficial level; to truly understand their motivations, concerns, and timelines.
Ensure you have the ability to validate early: validate with your customers and validate your own assumptions about how new technology will perform.
Prioritize automated deployment and testing infrastructure. This will lead to efficiency gains and allow developers to iterate and experiment more quickly and with more safety.
https://aws.amazon.com/blogs/aws/new-bring-your-own-keys-with-aws-key-management-service/
https://aws.amazon.com/blogs/security/are-kms-custom-key-stores-right-for-you/
This was always useful to know because for critical deliverables a common question I had to answer was “how fast could we get this done if we had everything we needed?” This was a clear way of illustrating the upper limit where more resources would not speed up delivery of the end goal.
TPMs love this one trick.
I also used SmartSheet at Stripe until they decided that $25/month was not worth it for me to be able to provide accurate projections of project status and completion dates.
I feel comfortable saying we never shipped with any compromises on security or data integrity.