<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Bits and Being]]></title><description><![CDATA[Thoughts on life, work, and the spaces in-between.

Written by Aleks Rudzitis, a Principal Engineer at AWS. Opinions are my own and may not be shared by my employer. ]]></description><link>https://www.bitsandbeing.com</link><image><url>https://substackcdn.com/image/fetch/$s_!TAJQ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F775fce24-d00f-45db-a226-ca0ccf91ef1b_1124x1125.jpeg</url><title>Bits and Being</title><link>https://www.bitsandbeing.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 07 May 2026 10:47:33 GMT</lastBuildDate><atom:link href="https://www.bitsandbeing.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Aleks]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[bitsandbeing@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[bitsandbeing@substack.com]]></itunes:email><itunes:name><![CDATA[Aleks]]></itunes:name></itunes:owner><itunes:author><![CDATA[Aleks]]></itunes:author><googleplay:owner><![CDATA[bitsandbeing@substack.com]]></googleplay:owner><googleplay:email><![CDATA[bitsandbeing@substack.com]]></googleplay:email><googleplay:author><![CDATA[Aleks]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Leaving Stripe: Parting Thoughts]]></title><description><![CDATA[Goodbye Bazel, Hello Brazil]]></description><link>https://www.bitsandbeing.com/p/leaving-stripe-parting-thoughts</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/leaving-stripe-parting-thoughts</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Mon, 05 Aug 2024 17:01:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!n0JE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last month I resigned from my post at Stripe to return to my previous role in AWS Cryptography. I am excited for the change, and for the opportunity to do work more aligned with my interests and domain expertise.&nbsp;</p><p>I will remember fondly my time at Stripe, and especially all of the engineers I had the opportunity to work with. Working at Stripe was a great learning and growth opportunity, and I have no regrets about my last four and a half years. However, I did have a few things I wanted to share before closing out this chapter of my life.&nbsp;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n0JE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n0JE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!n0JE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!n0JE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!n0JE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n0JE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2464262,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n0JE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!n0JE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!n0JE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!n0JE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2059e4c7-00ca-4a50-a6d6-91a538bf01e0_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Edit the company</h1><p>Every transition has elements of push and pull. I am leaving primarily because of the opportunity I have at AWS. However, I still want to take a moment to reflect on the secondary factors that made Stripe a less-appealing place for me to stay. What I share here I share only with the hope of inspiring others to edit the company in ways which I was ultimately unable to, with the hope that we make a great employer even better.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>&nbsp;</p><h2>Process, not products</h2><p>Stripe has a very &#8220;ship&#8221; based culture. The processes and norms feel optimized for delivering a big chunk of work where we can clearly say &#8220;we did it!&#8221; However, not every type of work fits into this paradigm.&nbsp;</p><p>The over emphasis on bundling everything into a &#8220;project&#8221; that we could declare victory over means teams would struggle to incorporate work that requires small-but-constant improvement. In my case, I&#8217;m specifically thinking of reliability and operations.</p><p>As an example, reliability is not something we can ship or declare victory over (though it often felt like we tried). Reliability requires constant vigilance, the flexibility to tackle problems as they occur, and the knowledge that problems will occur constantly.&nbsp;</p><p>Much of what made Stripe reliable and a generally productive place to work is the work of engineers who diligently sand away the rough edges that cause toil. Paraphrasing one of the lessons from Toyota Production Systems, <em>the most critical work is the work that smooths the way for other work.</em></p><h2>Execution versus impact</h2><p>Stripe is also very &#8220;impact&#8221; driven. This alone is not a problem, and I definitely appreciate how we got here. When you have a large number of high-agency engineers working on various problems, under budget and time constraints, you want folks to leverage their efforts to achieve the highest impact. And it follows from that we want to close the loop by holding people accountable to the impact they deliver.&nbsp;</p><p>However, as Stripe grows and folks are more removed from the decision making processes, I see the expectation of having folks speak to the impact of their work causing more and more tension with engineers. Especially folks earlier in their career. P<em>ut another way, an employee that doesn&#8217;t have input into the project they are taking on should not have to speak to the impact of the work</em>. At a certain point (at least for junior) we should be holding engineering managers accountable for impact and engineers accountable to execution.&nbsp;</p><h2>Overloaded engineers</h2><p>It was not uncommon during my time at Stripe for teams to be loaded with work such that there are 2 active work streams per engineer. This is a higher amount of loading than I think is healthy, and it impacted us in a couple of ways.&nbsp;</p><p>First, there is very little slack in the system to pick up work on an as-needed basis. <em>A certain amount of slack in resources is needed to polish issues as blemishes are noticed.</em></p><p>Secondly, this type of loading negatively impacts the sense of comradery on the team. A team where everyone is working on a different thing is not really a team; it is just a group of engineers sharing a sprint board. For teams with a broad scope of ownership, this also limits the amount of context anyone will have relative to all the systems they&#8217;re expected to operate. This makes on-call much more stressful and risks burning out the team.&nbsp;</p><p>Lastly, a smaller set of goals that everyone on the team has context on makes it easier for teams to self organize and allows engineers to experience a higher degree of agency within the team.&nbsp;</p><h2>Interfaces</h2><p>I saw a number of initiatives during my time at Stripe that focused on shoring up reliability and security for critical components. These generally took the form of evaluating each component of a system against a rubric.&nbsp;</p><p><em>Complexity, however, is at the edges. It is at the interfaces of our systems.</em> This turns out to be where the reliability risks are as well.&nbsp;</p><p>I&#8217;m not sure what my point is here. <em>But my advice is if you ever find yourself asking about the reliability or safety of a service, you&#8217;re almost certainly asking the wrong question. </em>The focus should be on systems; the flows and the units of functionality provided to internal and external customers.&nbsp;</p><h2>Incomplete migrations</h2><p>Stripe struggled to complete migrations, and we did not properly account for the weight of the work not finished.&nbsp;</p><p>Sometimes the migrations were not finished because we did not budget the work to move load to the new system after creating it. Often, these migrations were very expensive to actually do because of how untested the flows the run on them are; moving is too scary and stressful.</p><p>Other times we took a dependency on another team's &#8220;migration&#8221; which never completed.&nbsp;</p><p>We could reduce the context engineers need to maintain by a significant percentage by deprecating old systems before building new ones.&nbsp;</p><p>On my team, we had at least four incomplete migrations. This work-not-finished was a major drag on the productivity of the team. It made changes more difficult to reason about because of the multitude of code paths that need to be considered. Reliability was impacted for similar reasons: more to think about when making changes or performing mitigations during incidents.&nbsp;</p><p>I recommend folks adopt a budget on the number of systems or code paths their team will own. Space is created in the budget by turning off systems which are no longer needed.&nbsp;</p><h2>Shared context</h2><p>My largest personal struggle at Stripe was trying to get where I felt like I was &#8220;in the room&#8221; where strategy was being formed and solidified. As a staff-engineer wearing a tech-lead hat, I need shared context with my manager and their manager to correctly prioritize and do my work. This means knowing the criticism of plans and ideas from higher up. Receiving a filtered view of the world hinders my ability to affect change within the company.&nbsp;</p><p>This is actually my largest source of dissatisfaction with working at Stripe. At my previous employer, even though I was nominally operating one level below where I am at Stripe, I had more access to my leaders 2-3 levels up the organization hierarchy. There was significantly more transparency around the reality of resource constraints and business needs (and dare I say, internal politics.)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> I felt I had a seat at the table when it came to roadmap building and decision making. Lastly, at my previous employer, I felt that I had leaders willing to lend me their authority in a way that I do not have here.&nbsp;</p><p>It is possible that when it comes to Stripe that I am &#8220;holding it wrong&#8221; and that getting things done here requires a different skill set than the one I currently possess.&nbsp;</p><h1>Parting advice to engineers</h1><p>Finally, I want to share some parting advice to the engineers I worked with, especially those more early in their career. You&#8217;re all amazing people and you&#8217;re going to go places for sure. Hopefully the things I say below were thing you heard from me while we worked together, but in case you didn&#8217;t (and for everyone else), here they are:</p><h2>You can do it</h2><p>The teams that I worked on during my time at Stripe dealt with a lot of gnarly interoperability issues with external companies and services. We often discovered issues in open source packages or had to create our own libraries for obscure protocols. Issues were often found only after a system had been misbehaving for a while and large amounts of money were at stake. </p><p>This can be an intimidating and paralyzing environment to work in.&nbsp;</p><p><em>Believe that you can find the answer</em> by reading the code, by reading the RFC, by looking at raw packets. Everyone is capable of seeing the Matrix. It isn&#8217;t a quality folks are born with. It just takes practice. Gumption.&nbsp;</p><p>It does also often require muting the Zoom call and going heads down for some time. The job of your leaders is to play defense so that you have the room to do what you do best. Hold them accountable to this task. During an emergency the first priority is to mitigate. Let others focus on communication.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><h2>Sometimes you fail</h2><p>A senior engineer is just a junior engineer that has made a lot of mistakes and seen others make a lot of mistakes. Making mistakes, dragging teammates into incidents, and even occasionally breaking production comes with the territory.&nbsp;</p><p>It doesn&#8217;t feel good when it happens. It usually feels really shitty. But you are growing. And it sucks, but this is just what it feels like. You are pushing your own abilities. Failure is truly the best teacher.&nbsp;</p><p>You don&#8217;t owe the world perfection. Just do your best to make sure you have learned something from the experience.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>One way to grow from a mistake is to involve yourself as much as possible in the incident review/RCA process. Anecdotally I have found a well written analysis of an incident with a good list of remediation items has earned me more professional praise than causing the underlying incident earned me blame.&nbsp;</p><h2>Ask the hard questions</h2><p>Much of the value I provided at Stripe was being <em>that guy</em>. When engineers and leaders are sitting in a meeting reviewing a retrospective on an incident or current operational metrics, there is an ever present risk of complacency. This is a mode that even the most disciplined, operational rigorous teams can get into. And it requires someone to break from the social cohesion.&nbsp;</p><p>When you&#8217;re looking at a design or reading an incident report, notice when something gives you pause or causes you unease. When something doesn&#8217;t feel right, call it out.&nbsp;</p><p>Raise your hand.</p><p>Practice putting it into words.&nbsp;</p><h1>Onward</h1><p>As mentioned earlier, my new role is at AWS Cryptography. I am joining as a Principal Engineer and will work with my team to continue to innovate in the realm of applied cryptography. I&#8217;m excited to join an organization that is eager to take on large, moonshot projects and tackle some hard problems.&nbsp;</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I also offer this feedback with the understanding that I am but one person with one set of opinions, and that my ideas may conflict with the ideas of many other experienced engineers who know better than I do. When reading, please weigh heavily your own experience and common sense, and decide accordingly.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Why is this transparency important? See </p><div id="youtube2-vtIzMaLkCaM" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;vtIzMaLkCaM&quot;,&quot;startTime&quot;:&quot;2101s&quot;,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/vtIzMaLkCaM?start=2101s&amp;rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>If I do not know my readers/leaders, if I do not know what they value and what they doubt, it is very difficult to create value.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>In aviation there is a rule that is drilled into you early on: <a href="https://medium.com/faa/aviate-navigate-communicate-47d043d99d69">aviate, navigate, communicate</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Though you may on occasion fail at that too.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[How we did it]]></title><description><![CDATA[My past life executing on crazy ideas for AWS Cryptography.]]></description><link>https://www.bitsandbeing.com/p/how-we-did-it</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/how-we-did-it</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Mon, 07 Aug 2023 17:08:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lsYw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is a story about I shipped some really cool stuff at Amazon Web Services, my approach to executing on complex projects, and my advice to others. </p><p>Previous to my current position at Stripe, I was a software developer at Amazon Web Services (AWS), working on the Key Management Service (KMS) from 2015 through 2019.&nbsp;</p><p>The AWS Key Management Service (KMS) is, as the name implies, a service for managing cryptographic keys in the cloud. In short, the service allows you to create keys, use them to encrypt and sign data through an API, and grant or restrict access to the keys using a few different access control mechanisms. AWS KMS is used by other AWS services to protect customer data, and some customers also build applications directly on top of the KMS APIs.&nbsp;</p><p><em>Disclaimer: though informed by my experience as an employee of Amazon, my thoughts and opinions are my own and may not be shared by my former or current employer.&nbsp;</em></p><p>While working on this team, I had the privilege of leading the development of two high-impact products:</p><ul><li><p>The Bring Your Own Key (BYOK) feature allowed customers to import key material into AWS KMS from their own, on-premise HSM infrastructure.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> This provided an option for customers that wanted to take responsibility for the generation of keys themselves and/or did want their keys to exist only within AWS.&nbsp;</p></li><li><p>Custom Key Stores (CKS) allowed customers to create a KMS key that was backed by a key in a CloudHSM cluster they controlled.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> Similar to BYOK, CKS gave customers a way to use KMS while meeting requirements they may have to retain key material in devices that were more similar to traditional on-premise HSMs.</p></li></ul><p>Both features were key in enabling strategic customers to move to AWS. It was truly an honor to be a part of project work which grew the cloud business by such a significant amount.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p><em>However, what really excited me about this work was how we executed. And that is what I want to write about today.&nbsp;</em></p><p>Both projects were performed under time pressure from the AWS business and customers that were eager to utilize these features. As the lead engineer, I was responsible for scoping the work involved and arguing for the time and resources we would need. In both cases, I was given both less time and fewer people than I originally thought were necessary.&nbsp;</p><p>Otherwise, where would be the fun in it?&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lsYw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lsYw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!lsYw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!lsYw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!lsYw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lsYw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png" width="560" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:560,&quot;bytes&quot;:2072264,&quot;alt&quot;:&quot;Illustration of an engineer planning things. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Illustration of an engineer planning things. " title="Illustration of an engineer planning things. " srcset="https://substackcdn.com/image/fetch/$s_!lsYw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!lsYw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!lsYw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!lsYw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca730489-81f2-475a-b996-4e6a229d4c1b_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>(From this point on, I will jump between discussing the two different projects because there were a lot of similarities, and also because I can&#8217;t remember exactly which memory was from which effort.)</p><h1>Planning</h1><p>I found when the project planning really kicked into high gear by looking at my Amazon order history for &#8220;red string.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J_oD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J_oD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 424w, https://substackcdn.com/image/fetch/$s_!J_oD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 848w, https://substackcdn.com/image/fetch/$s_!J_oD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 1272w, https://substackcdn.com/image/fetch/$s_!J_oD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J_oD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png" width="540" height="405" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:405,&quot;width&quot;:540,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Charlie from It's Always Sunny conspiracy meme.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Charlie from It's Always Sunny conspiracy meme." title="Charlie from It's Always Sunny conspiracy meme." srcset="https://substackcdn.com/image/fetch/$s_!J_oD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 424w, https://substackcdn.com/image/fetch/$s_!J_oD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 848w, https://substackcdn.com/image/fetch/$s_!J_oD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 1272w, https://substackcdn.com/image/fetch/$s_!J_oD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd525476a-8b0c-4371-bce0-70f03f2c3962_540x405.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An actual image of me explaining the project status to my manager.&nbsp;</figcaption></figure></div><p>I exaggerate, but not <em>that</em> much.&nbsp;</p><p>My goal at the beginning of both projects was to get a birds-eye view of the work that would need to be completed, the effort required for each piece of work, and the dependencies between them. Having this laid out at the beginning would allow me to have an approximation of how long the project would take to complete. By understanding the dependencies between the tasks, it was also possible to understand the impact on headcount to the delivery of the project.&nbsp;</p><p>My notecard and red string approach went like this:&nbsp;</p><ol><li><p>Write a note card with the end goal, such as deliver feature X.&nbsp;</p></li><li><p>Think about what composes that end goal. In the case of an AWS product, this might be &#8220;API for X&#8221;, &#8220;console website for X&#8221;, &#8220;documentation for X&#8221;, etc. Attach each of these with a red string under the main goal.&nbsp;</p></li><li><p>Repeat for each subgoal. For example, &#8220;control plane API&#8221;, &#8220;data plane API&#8221;, etc. If at any point, there is a dependency between two tasks, connect them with a red string and make the task that must be completed first lower.&nbsp;</p></li><li><p>Repeat, decomposing tasks into smaller tasks (design API, review API with committee, etc.), until you&#8217;re down to tasks which can be completed by a single person in a couple of days, and they are not blocked by other dependencies.&nbsp;</p></li></ol><p>The end result is effectively a dependency graph. For each task, I would add an approximate time estimate. The leafs (tasks with no other tasks connected below them) should be tasks that developers can pick up and get started with. The longest path (adding up the estimates on all the tasks) from a leaf to the root is the shortest theoretical length of time the project could take, assuming human resources were not a constraint.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>As the project progressed, it would be easy to keep an eye on the critical path, add tasks that were uncovered, and identify work that was ready for someone to pick up.&nbsp;</p><p>This is more or less how I coordinated the work for the BYOK feature.&nbsp;</p><p>The downsides of this approach were that it relied on me being close to the board with all the cards and string in order to answer questions about the project, so for the CKS project work I acquired a Microsoft Project license<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> and repeated the exercise, except with gantt charts.</p><p>The advantage of using Microsoft Project was that the information was more portable. It was also in a format that other stakeholders such as project managers could consume and understand. However, the downside is that Microsoft Project asks a lot of questions, and puts an over-emphasis (in my opinion) on calendar dates when laying this out.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p><p>However, the start-with-goal-and-work-backwards goal is overall the approach I continue to use today to both understand the work a project will require and to provide accurate timelines.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Consensus building</h1><p>Both the BYOK and CKS features involved expanding the capabilities for the AWS Key Management Service, and therefore were under a lot of scrutiny by the security-minded folks of AWS Cryptography (of which there are many). An important part of delivering both projects without unexpected delays was ensuring there was consensus within the principal engineering community insofar that the security of the system was concerned.&nbsp;</p><p>An objection later in the project cycle, even if it were to be eventually withdrawn, would risk disrupting project timelines. Therefore, early on I made sure to meet with certain key Principal Engineers, one on one, to hear out their concerns for the feature, understand their recommendations, and explain my thinking about how we&#8217;d ensure the security of our system would not be compromised.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!blgo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!blgo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!blgo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!blgo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!blgo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!blgo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png" width="540" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:540,&quot;bytes&quot;:2036146,&quot;alt&quot;:&quot;Illustration of engineers arguing over plans. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Illustration of engineers arguing over plans. " title="Illustration of engineers arguing over plans. " srcset="https://substackcdn.com/image/fetch/$s_!blgo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!blgo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!blgo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!blgo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4168ac-2da3-419c-83cc-2918edb910c4_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As the project progressed, and certain aspects of the design were refined or clarified, I continued to check in to ensure there would be no surprise concerns or objections raised close to the deadline.&nbsp;</p><p>In all cases where concerns were raised by principal engineers (not necessarily about security), I prioritized responding in writing with the mitigations we had planned and our thinking on the topic. Whatever I wrote would also become a permanent appendix to the design document portfolio for the project, so that we would not have to rehash the same issues over and over.&nbsp;</p><h1>What we needed and why we need it</h1><p>For both projects, there was time pressure to deliver artifacts sooner than (at least on paper) the requirements and resources would allow for.&nbsp;</p><p>This pressure came from two different sources:</p><ul><li><p>External: we had made or wanted to make commitments to strategic customers about when a feature would be available in order to grow business and win contracts.&nbsp;</p></li><li><p>Internal: both projects were seen as risky and pushing the capabilities of our technology stack; there was a desire to know sooner than later if there would be unexpected difficulties.&nbsp;</p></li></ul><p>The solution to satisfy both concerns was an incremental approach to delivery. I&#8217;ll speak more about reducing the risk of project execution for internal stakeholders momentarily, but I want to speak to it from an external, customer facing perspective first.&nbsp;</p><p>For both projects, we knew of specific customers that were particularly interested in the new features we were building. Indeed, it was critical that the product we built would meet the requirements of these particular customers. However, large customers rarely will adopt a new AWS product into their production environment on the day they are given access. Just as we need time to build something, customers need time to validate it meets their requirements and understand how they will integrate it. This can take months (or even years for larger enterprises).&nbsp;</p><p>With this understanding, we prioritized the customer-facing aspects of the project work which would best allow us to collect feedback from the customers. We built APIs with the backends stubbed out, as well as documentation, that would give strategic customers the opportunity to start playing around with our product in a non-production, even if it would not be fully functional.&nbsp;</p><p>Just as importantly, this approach provided us an opportunity to change direction earlier in the product life cycle in a way that would be much more difficult if the backend functionality had already been built out.</p><h1>Validating early</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LWQ4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LWQ4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!LWQ4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!LWQ4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!LWQ4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LWQ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png" width="546" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:1925116,&quot;alt&quot;:&quot;A family using a skateboard like a car. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A family using a skateboard like a car. " title="A family using a skateboard like a car. " srcset="https://substackcdn.com/image/fetch/$s_!LWQ4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!LWQ4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!LWQ4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!LWQ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0ce3c07-304c-453c-8f3b-2d6402ef6141_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#8220;Build a skateboard before a car.&#8221;&nbsp;</p><p>I&#8217;ve heard this advice many times, but it can be difficult to apply. After all, a car is very little like a skateboard and the use cases are generally pretty different for each.&nbsp;</p><p>Reframed, the advice I would give to other engineers is to do the riskiest work up front. If you&#8217;re not sure you know how to make wheels that spin on an axle, make sure you can do that on a skateboard before you go about building the entire car.&nbsp;</p><p>In the realm of building novel key management products, this has most often meant validating that the new infrastructure can provide the throughput (operations per second) at the desired latency (time to complete each operation) early on. More generally, it means first doing the parts of the work which are least like what you&#8217;ve done before.&nbsp;</p><p>Ideally, after each project milestone, behavior of the entire system should be revalidated. For example, raw benchmarks against the hardware were satisfactory, does it still work when we add network hops? How about after we add authentication and authorization checks? How about after logging?&nbsp;</p><p>If there is a regression, it is useful to be able to narrow down where it happened. The worse case would be getting to the end and finding that none of the performance requirements are met.</p><h1>Priorizing for iterative development</h1><p>Validating often is easier when validating is easy.&nbsp;</p><p>The Custom Key Store (CKS) project specifically required a lot of new infrastructure components to be built. Each performance and functional test required a lot of up front orchestration work. For this reason, we spent a large portion of our early time building the test harness which would allow this testing to be performed automatically.&nbsp;</p><p>It was a risk to spend so much time on automated testing before the product was even at a minimum level of functionality, but the investment paid off greatly. Every change made by any developer on the project was automatically tested for functional regressions. This saved weeks of developer time over the course of the project on testing alone, and once the testing infrastructure was there, allowed people to proceed with a much higher level of confidence.&nbsp;</p><p>Similar to automated testing, ensuring the continuous integration and deployment infrastructure is working early on also allows a project to proceed at a faster pace.&nbsp;&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o0ca!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o0ca!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!o0ca!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!o0ca!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!o0ca!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o0ca!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png" width="490" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:490,&quot;bytes&quot;:1473358,&quot;alt&quot;:&quot;Abstract illustration of a software pipeline&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Abstract illustration of a software pipeline" title="Abstract illustration of a software pipeline" srcset="https://substackcdn.com/image/fetch/$s_!o0ca!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!o0ca!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!o0ca!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!o0ca!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d0c91e2-612d-43a8-a7ed-c3a2d18963d5_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>After launch</h1><p>Because of the constraints we were often working under, a product seldom shipped to customers with everything we initially had wanted. Often there were features excluded from the API, the web console, or issues on the backend that impacted operations for a short period.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p><p>After completing both projects, I took some time to compile a roadmap document. This document outlined the known deficiencies we&#8217;d need to address, work we might need to do to scale up in the future, and opportunities for growth later on.&nbsp;</p><p>Writing this down was valuable for two reasons: it closed the loop on the project with my leadership, providing a sort of bookend to the work while still communicating there was work to be done in the next planning cycle. Because it was written, it also allowed other engineers on the project to add their feedback and point out anything I missed.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TImD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TImD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!TImD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!TImD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!TImD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TImD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png" width="456" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:456,&quot;bytes&quot;:1952985,&quot;alt&quot;:&quot;Abstract illustration of software going on a shelf.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Abstract illustration of software going on a shelf." title="Abstract illustration of software going on a shelf." srcset="https://substackcdn.com/image/fetch/$s_!TImD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!TImD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!TImD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!TImD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1b6bcf-7430-4662-b738-478a841e66d3_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Wrap up and takeaways</h1><p>I would like to conclude by drawing attention to the fact that the title of this essay is &#8220;how <em>we</em> did it&#8221;, not &#8220;how <em>I </em>did it.&#8221; I worked with a fantastic team at AWS; truly one for the books. Nothing would have been possible without my excellent colleagues, my leaders, and the superb Principal Engineering community at AWS.</p><p>What I would like folks to take away from all of this:</p><ul><li><p>Product management is critical. It is a gift to work with leadership who take the time to understand the needs of customers below the superficial level; to truly understand their motivations, concerns, and timelines.&nbsp;</p></li><li><p>Ensure you have the ability to validate early: validate with your customers and validate your own assumptions about how new technology will perform.</p></li><li><p>Prioritize automated deployment and testing infrastructure. This will lead to efficiency gains and allow developers to iterate and experiment more quickly and with more safety.</p></li></ul><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/p/how-we-did-it?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thank you for reading Bits and Being. This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/p/how-we-did-it?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.bitsandbeing.com/p/how-we-did-it?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>https://aws.amazon.com/blogs/aws/new-bring-your-own-keys-with-aws-key-management-service/</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>https://aws.amazon.com/blogs/security/are-kms-custom-key-stores-right-for-you/</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p><a href="https://youtu.be/1IxDLeFQKPk?t=4108">Goldman Sachs Managing Director spotlighting this feature as a key reason they chose AWS during the re:Invent keynote in 2017</a>, and<a href="https://youtu.be/7-31KgImGgU?t=2027"> the Goldman Sachs CEO calling it out again in 2019</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>This was always useful to know because for critical deliverables a common question I had to answer was &#8220;how fast could we get this done if we had everything we needed?&#8221; This was a clear way of illustrating the upper limit where more resources would not speed up delivery of the end goal.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>TPMs love this one trick.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>I also used SmartSheet at Stripe until they decided that $25/month was not worth it for me to be able to provide accurate projections of project status and completion dates.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>&nbsp;I feel comfortable saying we never shipped with any compromises on security or data integrity.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Operational Reviews]]></title><description><![CDATA[My journey building people-driven infrastructure]]></description><link>https://www.bitsandbeing.com/p/operational-reviews</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/operational-reviews</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Wed, 05 Jul 2023 16:01:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xFYm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of my passion projects at Stripe is working to improve the engineering culture of my organization. Last year, I had the honor of revamping the process by which my organization reviews the health of its systems on a weekly basis. What follows is an abbreviated retelling of my experience in this domain and the lessons I learned from building this infrastructure for my organization at Stripe.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xFYm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xFYm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!xFYm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!xFYm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!xFYm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xFYm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png" width="442" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:442,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xFYm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!xFYm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!xFYm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!xFYm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c123585-3c0a-4fb1-90bb-e05c3a616ee6_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Disclaimer: though informed by my experience as an employee of Amazon and Stripe, my thoughts and opinions are my own and may not be shared by my former or current employer.&nbsp;</em></p><h1>Glossary</h1><ul><li><p><strong>EM</strong>: Engineering<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> Manager. Someone who manages teams of around 5 to 8 ICs.</p></li><li><p><strong>IC</strong>: Individual Contributors, generally software developers that spend their time building and maintaining things.&nbsp;</p></li><li><p><strong>MoM</strong>: Manager of managers. An individual that manages a handful of EMs.</p></li><li><p><strong>SLA</strong>: Service Level Agreement. A promise (often contractual) to customers or users about how your service will perform or how reliable it will be.&nbsp;</p></li><li><p><strong>SLC</strong>: Service Level Commitment. An internal commitment to how services should perform or how reliable they should be. Less formal than an SLA. At Amazon, the term &#8220;internal SLA&#8221; was used in lieu of SLC.</p></li><li><p><strong>SLO</strong>: Service Level Objective. A goal set by a team for how well it wants a service to perform, but not as formal or broadly communicated as an SLC.</p></li></ul><p><em>Note that these terms may vary in their usage company to company, or even within the same company. My goal is to define them as I am using them in this essay. </em>&nbsp;</p><h1>What is an operational review?</h1><p>An operational review (ops review, for short) is any process put into place to periodically review the health of the systems operated by a team. To keep things simple, I will scope this discussion to teams that operate software services and other computer infrastructure.&nbsp;</p><h1>My journey through operational review cultures</h1><h2>Amazon Marketplace</h2><p>I joined Amazon Marketplace (part of the retail side of Amazon) in 2010, fresh out of college. In my early career at Amazon, the organization I worked in had a very minimal process for oversight of operations. Teams had metrics, dashboards, and their own monitoring, but there was no concept of an SLA on these services (from the perspective of ICs at least). EMs would be asked to discuss outages or unexpected behavior during internal operational reviews, but ICs did not take part in this process aside from the ICs ending their on-call rotations preparing notes on issues that occurred during their shift.</p><p>This lax approach could be attributed to the asynchronous nature of the systems my teams managed. The systems we owned either worked on asynchronous tasks as part of order fulfillment workflows, or provided information for merchant-facing dashboards and APIs, which were relatively low throughput. High latency and blips of unavailability could be tolerated as long things eventually worked.</p><h2>Operational Reviews at Amazon Web Services</h2><p>In 2015, I moved to Amazon Web Services, where I would work for 5 years as an software developer on the Key Management Service. AWS was a completely different beast. AWS consists of customer-facing functionality and APIs. Large enterprise customers build their businesses on these APIs and services. As a result, some customers have very high expectations. Even for services which do not have an externally published SLA, teams are expected to set an internal SLA (i.e. SLC).&nbsp;</p><p>In general, most teams would review their core metrics (those most representative of customer experience) weekly during their on-call handoff (including any metrics they have an SLA on.) My team in particular would review our most core metrics, and then also randomly select another dashboard to audit both for clarity and for issues illustrated by the metrics themselves. This meeting was minimally attended by the previous and next on-calls and EMs, but was also well attended by other ICs.</p><p>At the director level there would typically be another organizational ops review. This meeting would involve a review of issues and major incidents affecting the org and would involve a deep dive into one of the products under that director each week. ICs were generally not expected at this meeting unless speaking to a specific issue.</p><p>The cornerstone of AWS operational review culture was the company-wide Wednesday ops review meeting. The basic agenda of this meeting was to celebrate operational wins, review upcoming changes folks needed to be aware of (code freezes, migrations, etc.), dive deep into critical incidents, and lastly a dive deep into a random service&#8217;s dashboard.</p><p>This meeting was open to all in the company, and was regularly attended by senior leadership and the Principal Engineer community. Every team was expected to send a representative (whether this was an EM or IC was discretionary, as long as they were able to speak for the team and could present the team&#8217;s metrics if chosen).</p><p>Having had opportunities to present both incident reports and my team&#8217;s metrics at this meeting, I can attest that this was both an incredibly intense and rewarding experience. Intense, because you&#8217;re speaking to (and questioned by) some of the most senior engineers and leaders at the company. Rewarding, because the culture was blameless, and because of how clear it was in those moments how ruthlessly the business cared about getting the fundamentals right.</p><h2>Stripe</h2><p>In 2020, I moved to Stripe to work on systems related to the secure storage and processing of credit card data.</p><p>When I first joined Stripe, there was no operational review process as such that was visible to ICs. There were no SLAs at the service level. There was a healthy culture of dashboards, and at least within my organization, detectors that alerted us when things were amiss. We had an on-call handoff meeting, but it generally did not involve a retrospective of service health over the last week, aside from pages or incidents.</p><p>As Stripe grew, there was an increased focus on operational reviews. We started defining SLCs for our services and tracked if we were meeting the SLCs. A number of other dashboards were created to help teams understand their posture with respect to service health and initiatives at Stripe where action was expected to be taken.</p><p>In 2021, within my organization a weekly meeting was created to review these dashboards and commitments with EMs and some ICs. EM attendance was expected, and ICs were invited but not required to attend. The same meeting was also used to deep dive into incident reports for the organization, and was generally used to perform the &#8220;review&#8221; expected of an incident, unless a higher level of review was deemed required due to the severity of the impact. The meeting was facilitated by volunteers (usually EMs).</p><h2>Taking the reins</h2><p>While I was pleased to see Stripe mature to the point that we were regularly tracking our service health, there were several areas where I felt the process fell short:</p><ul><li><p>The process was overly focused on the previous week and was not set up to reveal long term trends or potentially systemic issues.</p></li><li><p>The meeting didn&#8217;t have a clear agenda, and the audience was not well defined. Often the meeting did not have the right attendees to discuss more severe issues.&nbsp;</p></li><li><p>Folks in attendance from various teams were often not prepared to discuss operational incidents or abnormalities, and there was no mechanism to follow up after the meeting.</p></li><li><p>The meeting did not provide a forum for on-call engineers to share their perspective or concerns about system health.&nbsp;</p></li></ul><p>In early 2022, my organization went through a restructuring and the result for me was a new leadership chain. As a part of this restructuring, we also lost the management sponsor for our operational review process. This presented an excellent opportunity to start fresh.</p><p>With full support from my EM and MoM chain, I took on the responsibility of rewriting our operational review process.</p><p>I started with interviewing folks that had expressed an interest in sharing an opinion on our ops review process, both positive and negative. In these meetings, my goal was primarily to understand what folks hoped to get out of an ops review (that is, what would make it useful for them), and what folks liked and didn&#8217;t like about the previous format.</p><p>With this information in hand, as well as my own experience, I drafted a document which described the shortcomings of the previous format, our shared goals for the new format, and a clear statement of purpose. I ensured we all broadly agreed on these points. Next, using these previous points as an input, I drafted an outline for what the meeting would look like. After reviewing the planned format again with the critical stakeholders, and anyone who had a strong opinion, it was time to dive in and try it out. We&#8217;ve been iterating and making small improvements ever since then.</p><p>I will spend the remainder discussing what we developed.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Shared Meaning</h1><div class="pullquote"><p> Having a weekly meeting where we come up for air and confirm to ourselves that everything looks like it is running well allows us to be more heads-down and productive the rest of the week.</p></div><p>The first challenge was developing a shared understanding of why we had this process:</p><p>Why are we having this meeting? Why should folks be here rather than doing other productive things? Why do we care about ops?</p><p>At AWS, where I walked into a functioning operational review process, this step was not necessary. I believe a meeting that folks find useful can be a justification in of itself, even if you cannot put it into words. That meaning might even be partially of symbolic value. I think this was sometimes the case at AWS. The size and attendance of the Wednesday ops review meeting and who attended broadcasted a very clear message: &#8220;operations are important here. Our customers care. We care. We will make sure you understand the health of your systems. If something goes wrong, we will be working with you to figure out a solution.&#8221;&nbsp;</p><p>That was a message that came from the top. With that signal, local ops reviews at different levels tended to fall into place.</p><p>At Stripe, that shared culture did not exist to the same degree.&nbsp;</p><p>And that meant defining a purpose for ourselves. The challenge here was to define a statement of purpose that resonated with both ICs (whose participation we wanted to encourage) but also was pragmatic in that it reflected the needs of EMs and MoMs to have a clear picture in their heads of how we&#8217;re doing so they can both communicate to their leaders and prioritize work.</p><p>The &#8220;purpose&#8221; was thus synthesized from input from both the EM and IC interviews.&nbsp;</p><p>Here are the goals we came up with for the meeting:</p><ul><li><p>Verify our systems are behaving as expected and understand the ways in which we are not.</p></li><li><p>Review the health of the people-driven processes that support the operations of our services (e.g. pages, on-call workloads) and ensure workloads are sustainable.</p></li><li><p>Verify that for recent incidents we&#8217;ve learned the correct lessons, correctly assessed impact, and the appropriate artifacts have been generated.</p></li><li><p>Understand our long-term trends and ensure our services will deliver the expected functionality over the next 12 months.</p></li></ul><p>When a more succinct summary is called for, I say the purpose of the meeting is &#8220;situational awareness.&#8221; Having a weekly meeting where we come up for air and confirm to ourselves that everything looks like it is running well allows us to be more heads-down and productive the rest of the week.</p><h1>Facilitation</h1><p>Even with an agreed-upon purpose in place, we decided a facilitator is useful for consistency.&nbsp;</p><p>The facilitator of this meeting is responsible for making sure:</p><ul><li><p>The meeting is useful.</p></li><li><p>The time is well spent.&nbsp;</p></li></ul><p>Ensuring the meeting is useful and efficiently run helps drive continued participation.</p><p>This, of course, involves the usual duties of time management, making sure notes are recorded, agendas updated, and so on.&nbsp;</p><p>To ensure the content is useful, and the time is well spent, the facilitator is also responsible for following up with owners of content in the meeting to make sure they are prepared is essential. Folks should not be scrambling to figure out what they need to say, nor should they be surprised about what might come up during the meeting. This also means letting presenters know if we&#8217;re going to ask about an incident, for example.</p><p>During the meeting, the facilitator runs the presentation, shares visuals for the most part, cues others when it is their turn, and manages time.&nbsp;</p><p>We currently rotate facilitator duties within a group of interested volunteers. The goal is to balance redundancy (so that there is always a facilitator available) while making sure the facilitator has the experience and interest in fulfilling the responsibilities.&nbsp;</p><h2>Facilitator as Auditor&nbsp;</h2><div class="pullquote"><p>Don&#8217;t assume that everyone besides you understands it; be the backstop to misunderstanding.&nbsp;</p></div><p>There is an additional role that the facilitator must play that is important and unique enough to warrant its own section.&nbsp;</p><p>The facilitator is responsible for ensuring we&#8217;re holding ourselves to a high bar for clarity and understanding.&nbsp;</p><p>Other contributors are responsible for reporting on their systems, incidents, etc., but the facilitator is responsible for ensuring that the information presented makes sense, is understood by everyone, and to ask the questions that other folks might not be asking.&nbsp;</p><p>Unfortunately, this is the most difficult aspect of the role to teach. In general, my advice to prospective facilitators is to trust your gut: <em>if something doesn&#8217;t make sense, or you have a question, ask it.</em> Don&#8217;t assume that everyone besides you understands it; be the backstop to misunderstanding.&nbsp;</p><p>If a question is important to our understanding of our systems or an incident, but no one is able to answer it during the meeting, ensure it is assigned as an action item at the end to follow up. Make sure it is followed up on.&nbsp;</p><h1>Structure and Format</h1><h2>Attendance</h2><p>We request that all EMs and the organization leadership (including any technical leads) attend, if at all possible. Aside from that, each team is expected to have folks present which can speak to the operations for the team for the last week.&nbsp;</p><p>The attendee or attendees for each team are responsible for preparing slide content assigned to their team.&nbsp;</p><p>Anyone else interested is welcome to join.&nbsp;</p><h2>Contents and Structure</h2><p>The basic structure of the ops review format we&#8217;ve developed has four sections.&nbsp;</p><p>Each section has a clear owner, and the owner presents.&nbsp;</p><p>Across the four sections, there is the common goal presenting information which shows deviations from the baseline. This is in line with making sure the time is well spent and the content is useful. If there is nothing to report, the meeting progresses quickly and we spend time on other sections (or end early).</p><h3>Team/Service level metrics review</h3><p>Each team is responsible for discussing <em>anomalies</em> from how their services usually operate. The goal is to give each team space to speak to how the last week went for them, but to use the time efficiently by only speaking to deviants from the norm. We ask each team to independently review their operational metrics (defining operational metrics could be its own document), any incidents that they were <em>involved </em>in (this does not necessarily mean caused), and lastly any FYIs they have for other teams in the organization.&nbsp;</p><p>In our case, we also selected some overall organization metrics and assigned responsibility to the Tech Lead for our organization to gather and present these each week.&nbsp;</p><h3>Projects and Migrations</h3><p>Any team doing work, or aware of work being done by teams external to the org, that impacts teams in our organization can use this space to share updates.&nbsp;</p><p>In our case, we use this most frequently to inform teams about datacenter work. We also notify folks of database migrations, or other changes to how services operate.&nbsp;</p><p>If any systems owned by the organization have been flagged (through an incident or otherwise) as an area of risk or concern, we ensure we reserve time to review trends for that specific system as well.&nbsp;</p><p>This information is presented by the teams involved or most adjacent to the work.&nbsp;</p><h3>On-Call/Runner Metrics</h3><p>We use this section to review the health and sustainability of our on-call.&nbsp;</p><p>Key metrics are the number of pages and tickets. We also ask offgoing on-calls to submit subjective scores for their run experience. This provides EMs and our MoM visibility into the sustainability of the human side of our operations, and to ensure on-calls are not at risk of being overwhelmed and burned out.&nbsp;</p><p>As a practical matter, this section is prepared by the facilitator from the available data.&nbsp;</p><h3>Incidents</h3><p>For each incident where our organization was implicated, the owner of the incident report prepares a summary of the incident. We include a summary of the event, the proximate causes of the event, and action items.&nbsp;</p><p>We do not do a deep dive into the incident, as we found that it is better to have a dedicated meeting for those purposes. That way, the incident review can have the appropriate invitee list, which may be different from what is appropriate for the ops review.&nbsp;</p><h2>Format and Implementation</h2><p>When we originally started running this meeting, participants were very strongly in favor of using a Google Doc template for each meeting, with the thought that this would be easiest to fill in.&nbsp;</p><p>After a couple of weeks, we decided as a group to move to a slide deck format. Using a presentation format made it easier to divide up the content and assign ownership per slide. The format also encouraged brevity. Lastly, it was easier for the presenter to share during meetings.&nbsp;</p><p>We maintain a slide deck that is cloned for each week&#8217;s meeting. The template is updated as we make improvements to the format. Speaker notes are used in the template to provide instructions for where to pull data from and suggestions for how to use each slide.&nbsp;</p><p>The slide deck used for each meeting, as well as a video conference recording, are retained and linked to from a confluence page.&nbsp;</p><h1>Before the Meeting</h1><p>The meeting takes place weekly, and covers events from the previous week.&nbsp;</p><p>On Thursday before the meeting, an automated Slack reminder goes out to the facilitator to remind them to clone the presentation template, fill in dates, and update links. This ensures it is available for folks to start filling out.</p><p>Before the meeting, an automated Slack reminder goes out to on-calls to remind them to fill out the deck. The facilitator is also responsible for identifying if there is an incident that needs to be discussed this week and reminding the responsible team.&nbsp;</p><p>The facilitator is responsible for gathering data for some common aspects, such as the on call metrics.</p><p>Lastly, before the meeting, the facilitator is responsible for nudging any teams which have still not provided content for their sections.&nbsp;</p><h1>Running the Meeting</h1><p>At this point, running the meeting is fairly straightforward.&nbsp;</p><p>The slide deck acts as the agenda for the meeting and the meeting should just go through slide by slide, allowing owners of each slide to present content when it is their turn.</p><p>The facilitator should ensure the meeting runs on time and politely cut off discussions that may lead to not getting through the content for the meeting.&nbsp;</p><p>Lastly, the facilitator should ask questions themselves (preferably after others have had a chance to ask) to ensure all information presented is clear.&nbsp;</p><h1>Afterwards</h1><p>The facilitator is responsible for sending an email with a link to the recording and slide deck. The email should also clearly summarize any action items that were assigned during the meeting, or actions that need to be taken by folks.</p><p>The same should also be recorded on any wiki pages for future reference.&nbsp;</p><h1>Measuring Success</h1><p>Given the amount of work that had gone into planning this format, there was a strong desire by our organization leaders to understand the impact this new structure was having and the sentiments of participants.</p><p>We initially attempted to do this by sending out small surveys to participants after each meeting. Even though we kept the surveys short (2-3 multiple choice questions), I observed that people get quickly burned out filling one out each week, and the data quickly becomes sparse.&nbsp;</p><p>I found it was more useful to ask folks during our regular one on one meetings for feedback. This also provides the opportunity for folks to share more nuanced sentiments that might be lost in a multiple choice survey.&nbsp;</p><p>To this day, we continue to iterate on the format, how we collect and present data, and what items go on the agenda. The greatest signal I have that I&#8217;ve built something valued and enduring is that I am not the only contributor. Other folks in the organization bring their contributions to the format, and I overhear them evangelizing it to others.&nbsp;</p><p>I have been especially pleased to see the meeting continue to run itself while I am out on parental leave, and I am excited to jump back in later this year and see how things have grown and the improvements folks have made.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h61z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h61z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!h61z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!h61z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!h61z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h61z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png" width="454" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:454,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h61z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!h61z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!h61z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!h61z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c936a1-aa57-4e8c-8dff-9f0e745b998b_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I have a lot of thoughts on the use of the term &#8220;engineer&#8221; in my field. I (and most of my coworkers) do not have proper engineering degrees. I usually identify myself as a &#8220;software developer&#8221; unless the title &#8220;engineer&#8221; is explicitly forced on me.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Preventing Incidents, Part 2: Everything Else]]></title><description><![CDATA[Normalization of deviance and other things you should think about]]></description><link>https://www.bitsandbeing.com/p/preventing-incidents-part-2-everything</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/preventing-incidents-part-2-everything</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Thu, 15 Jun 2023 17:03:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qt3L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Disclaimer: my opinions are informed by my time at Stripe and AWS, but my thoughts are my own and not necessarily shared by my current or former employers.&nbsp;</em></p><p>This is part two of my reflections on incidents so far in 2023. <a href="https://www.bitsandbeing.com/p/preventing-incidents-part-1-testing">You can find part one here</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>To recap: consuming a large amount of information about how systems fail is one of the greatest privileges of working at Stripe (and formerly in Amazon Web Services). This year, I set a goal for myself to sit in as many of these incident discussions as possible to try to extract some common threads to share back to the organization. I&#8217;m taking the opportunity now to share some of the observations I&#8217;ve made so far that are more generalizable and appropriate to publish externally.&nbsp;</p><p>I devoted <a href="https://www.bitsandbeing.com/p/preventing-incidents-part-1-testing">part one</a> to testing. For part two, I will go into some other areas.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qt3L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qt3L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!qt3L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!qt3L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!qt3L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qt3L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png" width="312" height="312" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:312,&quot;bytes&quot;:284304,&quot;alt&quot;:&quot;Pixel art of a server on fire&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Pixel art of a server on fire" title="Pixel art of a server on fire" srcset="https://substackcdn.com/image/fetch/$s_!qt3L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!qt3L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!qt3L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!qt3L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F089d8ab7-84e3-4639-965b-bec77a9bdffe_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Change management</h1><p>Change management often gets a bad rap. When I speak of change management, I am specifically talking about the policies, procedures, and tooling used to manage infrastructure changes and software deployments. This usually centers around a form that should be filled out to describe a planned change and track its execution, but it also encompasses the policies and practices around when such a form is used, who must approve changes, and so on.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HISJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HISJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!HISJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!HISJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!HISJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HISJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png" width="320" height="320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/264f2041-3648-490a-8db5-65067b83749d_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:320,&quot;bytes&quot;:1273680,&quot;alt&quot;:&quot;Pixel art of software engineers writing documents. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Pixel art of software engineers writing documents. " title="Pixel art of software engineers writing documents. " srcset="https://substackcdn.com/image/fetch/$s_!HISJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!HISJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!HISJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!HISJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264f2041-3648-490a-8db5-65067b83749d_1024x1024.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Change management can have a bad reputation because of how it is perceived to be (or actually becomes) a bureaucratic process that gets in the way of productivity. However, change management, when applied appropriately, can be a powerful tool in de-risking work that can&#8217;t otherwise be de-risked with other mechanisms such as testing.</p><p>Lack of change management can lead to issues such as missed steps during rollout, an inability to detect when something has gone wrong, and an inability to roll back.</p><h2>What can be done?</h2><p>There are two separate points to discuss: when to use a more formal change management process, and what such a process should look like.&nbsp;</p><p>From my personal experience, a formal change management process is more effective and more respected by engineering teams when it is used sparingly and limited to one-off changes which are not a routine part of operations.&nbsp;</p><p>Routine changes, such as software deploys, are best handled by automated tooling which has safeguards built in and takes humans out of the loop. Automation should be used wherever possible, and manual change management processes used where manual activities must be performed by humans.&nbsp;</p><p>There are much richer resources out there on developing a change management process, so I will just cover the major points.&nbsp;</p><p>A change management document minimally requires:</p><ul><li><p>Description of a change to be performed</p></li></ul><ul><li><p>Procedure to follow to safely complete that change</p></li><li><p>Review / approval process for the description and procedure</p></li><li><p>Tracking of the change while it is being executed</p></li></ul><p>This can be done in a dedicated tool, a Google Doc, or anything in between.&nbsp;</p><p>What is most critical, however, is getting the description and procedure right.&nbsp;</p><p>A description should of course discuss the change being made. However, it must also acknowledge the potential risks. It may go as far as to link to a pre-mortem exercise run by the team, where the team performing the change has brainstormed all the ways the change could go wrong. This is critical information both for the person writing the procedure, but also for anyone reviewing the procedure for soundness.&nbsp;</p><p>The procedure itself must include:</p><ul><li><p>An outline of the steps required to perform the change. Enough detail should be in the procedure such that one does not need to rely on outside resources in order to execute the change. If this is not possible, necessary outside links must be provided in the procedure. This allows everyone reviewing the document to be sure that their understanding of the steps to be performed matches what actually will be done. I also recommend breaking the steps into sections or &#8220;acts&#8221;, with clear breakpoints where the procedure can be safely paused (or must be paused).&nbsp;</p></li><li><p>Rollback instructions, preferably for each &#8220;act&#8221;. The rollback instructions provide details on the steps needed to move the system back to the previous state. This is a place where it is easy to get lazy, but the rollback instructions must be written with the same rigor as the procedure. (You&#8217;ll be glad this is the case if you&#8217;re actually executing the rollback instructions.)</p></li><li><p>Observability. For section, it should be clear what signals will be monitored to check if the system is healthy or if a rollback must be performed. The procedure should include not only the signals to be monitored, but what healthy/unhealthy signals look like.&nbsp;</p></li></ul><p>Lastly, if the procedure that is about to be executed is new (and especially if it involves any bespoke tooling), it should be tested in a pre-production environment. Not only should the happy-path be tested, but any rollback procedures as well. The gold-standard for a rollback procedure would be one that you&#8217;ve tested in production.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> If possible, game-day failure scenarios in the rollout and ensure the rollback procedures handle them. A rollback option that hasn&#8217;t been tested may as well not exist.&nbsp;</p><h1>Observability</h1><p>Observability will rarely be the cause of an issue, but the ability to notice when something has gone wrong and quickly identify what that something <em>is</em> can certainly mean the difference between minor and major impact.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j8eB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j8eB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!j8eB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!j8eB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!j8eB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j8eB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png" width="302" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:302,&quot;bytes&quot;:1200471,&quot;alt&quot;:&quot;Pixel art of software engineers looking at a graph&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Pixel art of software engineers looking at a graph" title="Pixel art of software engineers looking at a graph" srcset="https://substackcdn.com/image/fetch/$s_!j8eB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!j8eB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!j8eB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!j8eB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8d3403-c986-4abc-95b9-341c94e62051_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What can be done?</h2><p>Modern tools and libraries can make creating extensive dashboards and alarms easy, but it is important to build with intention, and continuously refine the signals you&#8217;re looking at.&nbsp;</p><p>Dashboards should &#8220;tell a story&#8221; about what is going on. The &#8220;so what&#8221; of every graph on a dashboard should be clear; if not self-evidently, then at least by a written description. Graphs which don&#8217;t provide a useful signal should be removed or at least demoted to a less visible location where they won&#8217;t confuse someone trying to understand the state of a service.&nbsp;</p><p>There should be a strong signal to noise ratio. Developers should not become accustomed to graphs which &#8220;look bad&#8221; or behave erratically, even if the service is behaving as expected, because it will significantly lower the chance that someone notices a real issue in the future.&nbsp;</p><p>Ideally, your tool should allow the ability to quickly dive into the data when a deviation is apparent. Error rates up? It should be possible to relatively quickly go into the logs of a failed request. Latency up? The faster you can get to a latency breakdown, the better.&nbsp;</p><p>Alarms that actively alert operators must also have a high signal to noise ratio. Folks should not become accustomed to ignoring alarms. However, alarms must reliably alert when an actual issue occurs.&nbsp;</p><p>I also strongly believe the user experience of the tooling you use to implement observability is critical. If creating a dashboard or alarms is difficult or confusing, folks are much less likely to create the right observability signals or keep them up to date.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>Lastly, always have a plan B for if your monitoring itself goes down. In the worst case, this may look like &#8220;SSH to some critical services and tail logs&#8221; but have <em>something</em> in mind, and most importantly, write down what you would do.&nbsp;</p><h1>Runbooks</h1><p>Runbooks in this context are procedures written for responding to alarms. Like observability, good runbooks may not prevent something from going wrong, but they can certainly help stop things from going from bad to worse.&nbsp;</p><h2>What can be done?</h2><p>All alerts to developers must be accompanied by a runbook.&nbsp;</p><p>The most important trait of a good runbook is that it provides clear, unambiguous guidance for what to do. Remember, your runbook may be followed by a relatively inexperienced developer at 1 am.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> This doesn&#8217;t mean it is possible for a runbook to identify a remediation for every single possible type of failure, but it should at the very least provide clear guidance on where to find logs and metrics. If your runbook can get you to the bottom of the stacktrace or point you at the broken dependency and the contact for that dependency, that is pretty good.&nbsp;</p><p>Having a template can reduce the friction of writing runbooks while also providing some consistency. While at Stripe, I worked with another developer to develop a template to use for runbooks going forward. We came up with the following sections:</p><ul><li><p>1-2 sentence description of the detector and what it is telling you.</p></li><li><p>1-2 sentence description of the potential impact, so the responding operator understands the severity.&nbsp;</p></li><li><p>Temporary alerts or callouts about ongoing work, which may impact this detector. For example, &#8220;we&#8217;re currently rolling out feature x, which could trigger this alert. If this alert fires, start by turning off y feature flag.&#8221; These are not expected to be a permanent part of the runbook.&nbsp;</p></li><li><p>Troubleshooting steps. These are expected to be run in order. Each step is also written in a clear &#8220;if this, then that&#8221; format. We also used a visual language to clearly label links to logs and graphs, so that they&#8217;d stick out, and folks would not be surprised at where a link is taking them.</p></li><li><p>Escalation instructions for if the runbook has not resolved the issue and next steps are not otherwise clear.</p></li></ul><p>Lastly, schedule time (at least annually) for the team to review its runbooks and make updates to make sure all the information in them is still accurate.&nbsp;</p><h1>Incident reporting</h1><p>By &#8220;incident reporting&#8221; I mean the process of documenting, reviewing, and sharing lessons learned from incidents. Every incident is an opportunity to identify new safety mechanisms that need to be built, practices that need to be changed, and so on.&nbsp;</p><h2>What can be done?</h2><p>Establish a culture of following up on incidents, identifying causes, planning remediation work, and sharing out lessons. What this exactly looks like will depend on the existing institutions of a company. Establish a bar for what constitutes an incident, and what level of review is appropriate for that level of impact.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lFci!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lFci!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!lFci!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!lFci!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!lFci!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lFci!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png" width="314" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:314,&quot;bytes&quot;:1276739,&quot;alt&quot;:&quot;Pixel art of software engineers wondering what went wrong&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Pixel art of software engineers wondering what went wrong" title="Pixel art of software engineers wondering what went wrong" srcset="https://substackcdn.com/image/fetch/$s_!lFci!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!lFci!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!lFci!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!lFci!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bd277c-46ca-4cb5-85fd-2334e452f415_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Create a template that captures the information that should be collected for each incident. A timeline and quantified impact are great for hard data, but I have found there is a large amount of value in creating space for a team to write a narrative about what happened through the eyes of the folks that worked on the incident. A &#8220;<a href="https://en.wikipedia.org/wiki/Five_whys">five whys</a>&#8221; approach can be useful for unblocking thinking, but also has the risk of being applied too rigidly.&nbsp;</p><p>Incident reviews should be <a href="https://sre.google/sre-book/postmortem-culture/">blameless</a>, in that the blame should not be laid at the feet of specific engineers or teams, but should rather strive to understand the human-computer interactions which led to the incident taking place (if applicable).&nbsp;</p><p>Lastly, it is important to recognize the work that goes into writing an incident report and facilitating the review around it.&nbsp;</p><p>At some future date, I will write a post about the operational and incident review process I developed for my organization at Stripe.&nbsp;</p><h1>Conclusion</h1><p>I&#8217;ve covered a lot of different topics in these two posts, so before wrapping it up, I would like to discuss a few overarching themes that I&#8217;ve alluded to.&nbsp;</p><h2>Normalization of deviance</h2><p>The greatest risk to any of the best practices I&#8217;ve mentioned is the <a href="https://en.wikipedia.org/wiki/Normalization_of_deviance">normalization of deviance</a>. In short, the normalization of deviance is when the habit of skirting or not following best practices becomes culturally acceptable within an organization.&nbsp;</p><p>Some examples of normalizing deviance could be not creating or updating unit tests for a change, performing risky manual actions in production without following an organization&#8217;s change management processes, or creating detectors without runbooks. This is only a partial list; normalization of deviance from any practice is possible.&nbsp;</p><p>If there is one act that will torpedo any attempts at operational rigor, it is a culture of normalizing deviance.&nbsp;</p><p>The most effective antidote is to quickly identify cases where a process is not being followed and then either:</p><ol><li><p>Explicitly and visibly prioritize diligence around wherever it is you&#8217;re slipping to bring the organization habits back in line.</p></li><li><p>Edit the process so that what is written on paper more closely aligns with what folks believe is the more appropriate level of rigor.&nbsp;</p></li></ol><p><a href="https://www.amazon.com/Challenger-Launch-Decision-Technology-Deviance/dp/022634682X/">Recommended further reading</a>.&nbsp;</p><h2>Slow is smooth; smooth is fast; safety begets efficiency&nbsp;</h2><p>While the principal topic has been reducing the occurrence and severity of incidents, the suggestions in this post and the previous one are not just about safety and reliability. A company which has built tools and processes to execute safety is also one where engineers can execute with more confidence.&nbsp;</p><p>This is most evident when it comes to testing: if an engineer can edit a code base to add/remove/change functionality and have confidence that the testing infrastructure will catch any unexpected side effects, that engineer can move their code through the pipeline faster and deploy with more confidence. Reviewers of the code can also focus their attention on the intended behavior of the change, knowing that tests largely have their back.&nbsp;</p><p>Similarly, mature change management practices can remove the toil and uncertainty around rolling out a big manual change. Even if the process can feel bureaucratic to teams, at least it can be of a known quantity. A predetermined approval process can also reduce churn and allow a team to move forward confidently.</p><p>All of this is to say that proper investment in the areas discussed in these last two posts can yield not only decreased downtime, but also greater developer productivity.&nbsp;</p><h1>Further Reading</h1><p>I want to take a moment to acknowledge that this post and the preceding one have only begun to touch on everything involved in building and maintaining resilient systems. I haven&#8217;t even touched on writing software and services to be more resilient in the first place! For that topic, there are many great books out there, of which I&#8217;ve read absolutely none, but I would probably start with the classic <a href="https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321">Designing Data-Intensive Applications</a>.</p><p>For more on operating large scale systems, <a href="https://sre.google/sre-book/table-of-contents/">Google&#8217;s Site Reliability Engineering book</a> remains a classic, though I will warn about considering it the final word on any topic.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p><em>If you&#8217;ve made it this far, thank you for reading! I look forward to sharing more thoughts and tales from the trenches!&nbsp;</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The rollback procedure itself must be safe enough that folks are comfortable executing it in production.&nbsp;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>See my upcoming post, &#8220;Why SignalFX is actively harmful.&#8221;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>This is not the time to make someone read an essay or architecture design document in order to understand your alert.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>For many years, I have hoped that someone at AWS would write their own O&#8217;Reilly book on the same topic.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Preventing Incidents, Part 1: Testing]]></title><description><![CDATA[A reflection on incidents in 2023, so far.]]></description><link>https://www.bitsandbeing.com/p/preventing-incidents-part-1-testing</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/preventing-incidents-part-1-testing</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Thu, 01 Jun 2023 17:00:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TAJQ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F775fce24-d00f-45db-a226-ca0ccf91ef1b_1124x1125.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Disclaimer: my opinions are informed by my time at Stripe and AWS, but my thoughts are my own and not necessarily shared by my current or former employers.</em></p><p>Consuming a large amount of information about how systems fail is one of the greatest privileges of working at Stripe (and formerly in Amazon Web Services). &#8220;The greatest teacher, failure is,&#8221; as Yoda says. Both my current and former employer hold an incredibly high bar for how they operated, and as a result, there is a great amount of information available on even the smallest failures; little failures that happen every day but generally go unnoticed by our users.&nbsp;</p><p>This year, I set a goal for myself to sit in as many of these incident discussions as possible to try to extract some common threads to share back to the organization. As I head out on the second part of my parental leave, I wanted to pause to share some of the observations I&#8217;ve made so far that are more generalizable and appropriate to share externally.&nbsp;</p><p>The need for testing was a common theme, so I will devote a first post to just testing. Part 2 will cover the remaining learnings.&nbsp;</p><p>Testing a change to make sure it will work as expected is evergreen advice, and the impact that lack of testing has should not be surprising. During this most recent period in my organization, a low-double-digit percentage of incidents could be attributed in some part to a lack of testing. I will highlight some of the specific types of issues I have observed.&nbsp;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Testing in the code base</h1><p>The simplest form of testing failure pertains to the lack of tests in the first place. A service may have no tests at all, tests may only cover a very limited number of code paths, or provide minimal assertions around the correctness of code.&nbsp;</p><h2>What can be done?</h2><p>&#8220;Write more tests&#8221; is an intuitive response to address these issues. But it might also be the laziest possible remediation item in these cases because it ignores why these tests are lacking in the first place. If testing is not being properly performed by engineering teams, there is likely an underlying cause, such as the friction involved in adding use cases, or the perceived or actual utility of these tests.</p><p>For example, the following may add friction to proper testing:</p><ul><li><p>There are no frameworks for running tests at the appropriate depth (if a pattern is established for unit tests, it may not be easily extendable to functional/integration<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> tests).</p></li><li><p>There are so few existing tests engineers feel it is an unacceptable cost to create tests as a part of their planned work. </p></li><li><p>The testing framework and overall testing strategy is not well documentend. </p></li></ul><p>There may also be deficiencies in the utility of tests in the code base. This may be because:</p><ul><li><p>Tests exist but assertions are superficial. They may validate that the code executes, but critical assertions may not exist to validate the behavior of the code.&nbsp;</p></li><li><p>Useful mocks for dependencies do not exist, making it difficult to have confidence in the correct behavior of code which interfaces with dependencies.&nbsp;</p></li><li><p>Failing tests are not used to gate merges or deploys.</p></li><li><p>Test failures do not correlate well with actual issues in the code (see the section on Testing Observability below.)</p></li></ul><p>Fixing testing deficiencies is not a trivial matter. For testing infrastructure to be useful (and used) it must be (and recognized to be) a tool that makes a task simpler.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> It requires investment the same way we invest in deploy and orchestration tooling (and quite possibly much more investment). This is all the more difficult in environments, such as Stripe, that require synthetic or &#8220;mock&#8221; third-party components to test against, and even more so if the ownership is not clear around who should create those mocks.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>A few approaches I have seen be useful in improving test coverage over time:</p><ul><li><p>Standardizing the testing frameworks and patterns used within a code base to reduce the cognitive load of writing new tests.&nbsp;</p></li><li><p>Requiring minimal test coverage for a change before merging code.</p></li><li><p>Building tools which allow engineers to visually inspect test coverage for the code they&#8217;re working with. (This makes it more obvious where coverage is missing.)</p></li></ul><p>I would avoid anything which makes individual engineers feel responsible for the overall code coverage for a service. This is a valuable metric for leadership to track, but is difficult for engineers to address while delivering other work. It is much more effective to show opportunities for improvement in the context of existing work.&nbsp;</p><h1>Testing at scale</h1><p>Beyond basic functional testing, testing at scale is an additional challenge. It is particularly problematic for infrastructure teams and services that maintain stateful components.&nbsp;</p><p>Periodic load testing where a call path is pushed to its limits is a critical piece of engineering for reliability. However, this can be very challenging to do exhaustively. Call patterns in production may not be easy to replicate.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> Load can also come from surprising places, especially for infrastructure teams. For example, infrastructure teams might face a thundering herd to its orchestration systems when another team attempts to scale up to mitigate its own issues.</p><h2>What can be done?</h2><p>Even though load testing is not a panacea, it still plays a valuable role in the overall operational readiness of a service. It is all the more effective when paired with a diligent analysis of real life use cases and pathologically bad, worst case scenarios. That is to say, testing is not as simple as pointing a load test against an API and writing down the Requests Per Second (RPS). A thorough load test should take into account:</p><ul><li><p>Current, actual customer behavior, so that the load test will accurately replicate the type of load currently placed on the system.&nbsp;</p></li><li><p>Worst-case use cases which are not commonly seen, but would stress the system. It is important to understand where rate limits and load shedding may be needed to protect the service.&nbsp;</p></li><li><p>Future business needs, in case the current customer behavior is subject to change.&nbsp;</p></li><li><p>Whether real dependencies of the service will be used in the test (and if not, an honest description of the limited usefulness of the result).</p></li><li><p>How components behave differently under synthetic traffic versus real-life usage patterns. For example, caching (both explicitly used by the service or used implicitly by dependencies such as databases) may behave very differently depending on how synthetic traffic is constructed.&nbsp;</p></li></ul><p>A thorough load test should report on:</p><ul><li><p>The maximum throughput a service can handle while maintaining latency and availability targets.&nbsp;</p></li><li><p>The component or system which limits performance, so that it is clear where future investments would need to be directed in order to improve performance.&nbsp;</p></li></ul><p>Lastly, teams should take action on the outputs from their load tests:</p><ul><li><p>Teams should create alerts to notify them when the current load approaches the theoretical maximums. The greater the uncertainty about the theoretical maximum, or the greater the difficulty in raising the maximum, the lower the alerting threshold should be. Reviewing these metrics on a weekly basis can also provide further advance notice of potential scaling issues.&nbsp;</p></li><li><p>If appropriate, service owners should set rate limiters to protect services based on known maximum supportable throughput.&nbsp;</p></li></ul><p>Load tests should be performed on a regular cadence throughout the year (especially in anticipation of periods of high load), as well as after architectural changes. </p><h1>Test in the right environments</h1><p><em>Where</em> to test can pose an interesting challenge. Commonly, a shared QA/Beta stage is established in which everyone is supposed to deploy code and perform validation of changes before deploying to production. The result is an environment that is frequently broken because it contains changes being tested that do not work as such. This can make it difficult for an engineer to be confident about any single change. Breakages can become so frequent that engineering teams become accustomed to ignoring signals from errors in this environment.</p><p>This can be especially problematic for infrastructure teams. Engineers that rely on stable deployment tooling to be productive with their own tests in QA. They may not appreciate being beta testers for new versions of tools that impact their productivity.</p><h2>What can be done?</h2><p>I have seen two solutions in practice.&nbsp;</p><p>The first is to have a <em>series</em> of testing environments, where the quality ratchets up as you go along. While the first environment in the pipeline may be a free-for-all, the final environment before production is expected to be more stable and provide a high signal-to-noise ratio. In AWS I saw as many as 4 stages before production.&nbsp;</p><p>The other approach I have seen (sometimes used in conjunction with the first) is to have separate testing environments per team or organization. An infrastructure team may have a QA environment, while other teams&#8217; QA environments would use the production infrastructure and dependencies. This isolates teams from each others&#8217; testing.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><p>In cases where it is not appropriate to use a production dependency for a test environment, consider a post-production environment (deployed to <em>after</em> production in the deployment pipelines) which exists for teams to test their candidate code against.&nbsp;</p><p>Another option (which I have not seen completely implemented but have always wanted to) is to provide engineers a way to spin up bespoke testing environments which are effectively a clone of the production state of the world with just the component to be tested replaced with a candidate.</p><h1>Testing observability</h1><p>Lastly, for testing to be valuable, there must be confidence in the signal provided by a testing environment. Unit and integration tests must not be flaky such that engineers are used to ignoring failures. If a pre-production or staging environment is used, alerts must be in place to notice that a potentially breaking change has been introduced before it goes into production. And those alerts must again have a high signal to noise ratio such that the warnings they provide are not ignored.&nbsp;</p><h2>What can be done?</h2><p>I will devote more in Part 2 about avoiding the normalization of deviance. For now, suffice it to say that any system which is used to gate code being deployed to production must be treated as production in terms of quality. This means having similar alerting as production, and an urgency to fix issues that is second only to an actual production issue.&nbsp;</p><p>Engineers must also be given adequate alternatives for testing such that the final validation stage is not the first time when their code is being exercised. Otherwise, the constant need for urgent fixes to address recurring breakages will not be sustainable.&nbsp;</p><h1>Conclusion</h1><p>This concludes my thoughts and observations on testing. While my recent observation of Stripe infrastructure incidents provided the inspiration for writing this essay, the experience I draw from is much greater and indeed encompasses my software development career to date.&nbsp;</p><p>Please join me for Part 2, where I will cover other common contributors to incidents, as well as some other recommended best practices.&nbsp;</p><p><em>Thank you to <a href="https://www.linkedin.com/in/yanjieniu/">Yanjie Niu</a> for editing this post and providing thoughtful feedback. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.bitsandbeing.com/subscribe?"><span>Subscribe now</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I am deliberately avoiding spending too much time on testing nomenclature because it appears to vary quite a bit between companies. For example, what were called Integration tests at Amazon are called Functional tests (at least in the teams I have worked on) at Stripe and Integration tests are a different thing.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In Disney&#8217;s 1998 Mulan film, the army conscripts are asked to climb a pole to retrieve an arrow from the top. However, they are given two weights, representing discipline and strength. Many attempt to climb to the top while carrying the weights, but it is Mulan who realizes that these are not burdens but essential tools for completing the task. (Strange, given how on-the-nose the metaphor is.) In software engineering, our weights are testing and observability. You can either reluctantly drag them along, or use them as a powerful tool.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>In my previous role at AWS, working on the Key Management Service, testing was a relatively simple prospect. No matter how complicated the infrastructure became, the correct behavior of the system was simple to define: cryptographic functions have a deterministic result, and it is relatively trivial to verify the behavior of the service. (I am simplifying a lot here. There are a lot of aspects that were not <em>quite</em> as simple as verifying our AES GCM API did its AES GCM things. Authentication, for example, adds some complexity. However, for AWS, all of this is at least handled in-house and we have the information to write useful assertions.)</p><p>The payments space, by comparison, is much more complicated. A payment processor provides value by causing money to move in the real world, and that money movement is done by reaching out to a diverse population of financial third-parties. Not only is this done through APIs for which we don&#8217;t have iron-clad specifications and for which there don&#8217;t exist mocks or simulators, there is often a non-trivial amount of third-party-specific infrastructure in the mix, which is not always possible to create a testing version of.&nbsp;</p><p>To summarize a bit: it is much more difficult to test changes where your code is supposed to have external side-effects, especially when those side-effects are related to money.&nbsp;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>And in the event of something actively malicious, such as a denial-of-service attack, the call pattern may be specifically crafted to be pathologically harmful to your service. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>For example, in AWS, teams building services which depend on S3 rarely use a non-production S3 endpoint to test their changes.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Story Time]]></title><description><![CDATA[The time I became the IT help desk for Amazon's Beijing office]]></description><link>https://www.bitsandbeing.com/p/story-time</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/story-time</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Thu, 25 May 2023 19:34:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NFo2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I previously published this story in my personal newsletter, but this is a more proper home for it, so I thought I would reshare this post while something else I&#8217;m working on is in editing. Enjoy.</em></p><div><hr></div><p>For those of you who have been in the industry for a while, this probably won&#8217;t be very interesting. However, if you&#8217;ve joined since things like Slack and Zoom became commonplace, it might be a little more novel.</p><p>I joined Amazon in 2010 after graduating. Back then we didn&#8217;t have Zoom or Slack. We didn&#8217;t have Microsoft Teams. We didn&#8217;t even had Skype for Business. We had Microsoft Office Communicator. It looked a bit like the old MSN Messenger, and no one really cared for it.</p><p>But this isn&#8217;t a story about that. This is a story about:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NFo2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NFo2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NFo2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NFo2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NFo2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NFo2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg" width="500" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Polycom Phone&quot;,&quot;title&quot;:&quot;Polycom Phone&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Polycom Phone" title="Polycom Phone" srcset="https://substackcdn.com/image/fetch/$s_!NFo2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NFo2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NFo2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NFo2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96294d50-9daf-4ea6-83ed-99a101517b1c_500x500.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Back when I joined Amazon, everyone was issued one of these.</p><p>Since we didn&#8217;t have Zoom (or whatever), it wasn&#8217;t uncommon when you wanted to have a synchronous conversation with an engineer in a different location, that you would do this:</p><ol><li><p>Find their name on in internal website called the phone tool</p></li><li><p>Find a 5 digit extension on their page</p></li><li><p>Push physical buttons one at a time</p></li><li><p>Pick up a physical handset and proceed to hold it against your ear like a cave person for the next 30 mins or whatever.</p></li></ol><p>Our phone system also supported voicemail, so sometimes you&#8217;d come back to your desk and there would be a flashing red light indicating you had a voicemail.</p><p>To listen to it, you&#8217;d again have to push a series of buttons, enter your pin, and LISTEN TO THE MESSAGES ONE AT A TIME just like in medieval times.</p><p><strong>But all of that is just background</strong></p><p>This isn&#8217;t a story about phones. This is a story about my phone. Or specifically, my extension.</p><p>Every employee was allocated an extension out of a (at the time) 4 digit namespace, I believe randomly.</p><p>My extension was: 0...0...0...1</p><p>That&#8217;s right. Somehow they gave me, a lowly SDE 1, extension number 1.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> Again, I believe this was completely random.</p><p>It is also important to know that all Amazon extensions mapped to a full 10 digit number, which was done by prefixing a the area code (206) plus the first 3 digits (let&#8217;s say 266 in this case, because that&#8217;s what it was). That made my phone number 206-266-0001.</p><p>Surprisingly, this did not lead to me getting lots of spam calls, as spammers crawled the Amazon corporate phone namespace.</p><p>It did however, lead to some confusion with Amazon&#8217;s main corporate phone number: 206-266-1000.</p><p>Me: &#8220;Hello?&#8221;</p><p>Them: &#8220;Uh&#8230;hello. Is this&#8230;Amazon?&#8221;</p><p>Me: &#8220;Yes, I am Amazon. Speaking?&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>I did have people call me trying to return things. Or sue me. However, all in all, it was surprisingly uncommon.</p><p><strong>But wait, there&#8217;s more&#8230;</strong></p><p>It wasn&#8217;t just the public that would be confused. My extension also led to some internal confusion.</p><p>On one occasion, I was hard coded as the test phone number for some automated test suite, because they thought why would extension 1 be actually allocated to anyone.</p><p>A more interesting time, for a few weeks, every time I came to the office my phone had several voicemails in Mandarin, from other Amazon extensions. This one took a while to sort out because I do not speak Mandarin. It turns out some internal wires were crossed and for a short period I had become the IT help desk for one of the offices in Beijing.</p><p><strong>Conclusion</strong></p><p>One of the hard parts of leaving Amazon to join Stripe was knowing that I was giving up this extension. I actually had the same desk phone for my almost 10 years at the company, and continued to use it periodically. (Most often, to call candidates during phone interviews.)</p><p>Because it was a VoIP phone, the phone (and my extension) followed me to wherever there was a PoE Ethernet jack available.</p><p>Unfortunately, I was not allowed to take it with me to Stripe.</p><p>I&#8217;ll miss you little Polycom.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsandbeing.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Bits and Being! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Amazonians will be quick to point out that this is technically incorrect, because all extensions started with 6. Technically, mine was 60001, but only the last 4 digits were variable.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>No, not really. Usually I would just say they had the wrong number and tell them to dial 206-266-1000.</p></div></div>]]></content:encoded></item><item><title><![CDATA[A Tale of Three Grinches]]></title><description><![CDATA[An essay that no one asked for.]]></description><link>https://www.bitsandbeing.com/p/a-tale-of-three-grinches</link><guid isPermaLink="false">https://www.bitsandbeing.com/p/a-tale-of-three-grinches</guid><dc:creator><![CDATA[Aleks]]></dc:creator><pubDate>Thu, 18 May 2023 06:24:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TAJQ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F775fce24-d00f-45db-a226-ca0ccf91ef1b_1124x1125.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Hello everyone! I promise this isn&#8217;t a preview of the typical content I plan to share on this blog, but I needed something to prime the pump, so please enjoy while more tech content makes it through the pipeline. </em></p><p>Dr. Seuss's book, <em>How the Grinch Stole Christmas!</em>, is a cornerstone of both children's literature and Christmas lore. The book was originally published in 1957, but has since then spawned three different movies, each with their own interpretation of the original story. In this essay, I will walk you through a comparative analysis of the three different interpretations. I hope in the end I will convince you that the 2018 film, <em>The Grinch</em> by Illumination, is the superior Grinch adaptation. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q9oF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q9oF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 424w, https://substackcdn.com/image/fetch/$s_!Q9oF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 848w, https://substackcdn.com/image/fetch/$s_!Q9oF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 1272w, https://substackcdn.com/image/fetch/$s_!Q9oF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q9oF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png" width="264" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:264,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146287,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q9oF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 424w, https://substackcdn.com/image/fetch/$s_!Q9oF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 848w, https://substackcdn.com/image/fetch/$s_!Q9oF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 1272w, https://substackcdn.com/image/fetch/$s_!Q9oF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F243b769a-3415-4f49-ba31-ae4416f0ebc2_264x377.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>My daughter Amelia has been a huge Grinch fan since she first read an abbreviated board-book version of the story. I have since then had ample interaction with that adaptation, the original picture-book version, and several different movie interpretations. I have also since then formed some strong opinions on the merits of each adaptation; specifically how each tackles the Grinch's backstory and the resolution of the conflict in the main plot. Excluded from this analysis is any consideration of how well any of the movies are executed upon <em>as movies</em>. That's a job best left up to the critics and reasonable persons may disagree on this point. </p><p>At a high level, all three movies follow the same basic structure. We learn of a people called the Whos, who live in Whoville. Christmas is culturally a very important holiday for them and great effort is expended in celebrating it. We are also introduced to a green, hairy creature called the Grinch who lives on Mount Crumpet, just north of Whoville. The Grinch does not like Christmas and despises the festivities. He resolves to stop Christmas from happening by donning a Santa Claus outfit and stealing all the gifts and decorations while the Whos sleep. On Christmas morning, as he is about to destroy his loot by pushing it off his mountain, he becomes aware of the Whos singing (and celebrating) nonetheless. Christmas comes all the same. This causes the Grinch to have a change of heart, understand the true meaning of Christmas, and return to Whoville to join in the festivities. </p><p>Diving in, the first movie adaptation was a 1966 direct-to-television adaptation of the book by CBS. This adaptation is very true to the original source material, both in terms of the animation style and plot (with only some minor, inconsequential deviations and room made for some musical numbers). For that reason, it also serves as a good baseline for my analysis. [1] </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RXh_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RXh_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 424w, https://substackcdn.com/image/fetch/$s_!RXh_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 848w, https://substackcdn.com/image/fetch/$s_!RXh_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 1272w, https://substackcdn.com/image/fetch/$s_!RXh_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RXh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png" width="370" height="270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6754021d-88a0-491d-9000-cb02664de961_370x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:210212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RXh_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 424w, https://substackcdn.com/image/fetch/$s_!RXh_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 848w, https://substackcdn.com/image/fetch/$s_!RXh_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 1272w, https://substackcdn.com/image/fetch/$s_!RXh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6754021d-88a0-491d-9000-cb02664de961_370x270.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Like the book, the 1966 movie provides scant details on the origins of the Grinch. All that we know is he lives on Mount Crumpet, he hates the Whos, and most of all he hates Christmas. The narrator admits the reasons for this are unknown, and only suggests that most likely reason might be because "his heart is two sizes too small." The movie makes no effort to explain further. The two movie adaptations that followed were clearly not satisfied to leave it at that. </p><p>The 2000 film <em>How the Grinch Stole Christmas</em> by Universal Pictures does lead with the fact that the Grinch's heart is incorrectly sized, but the movie establishes a much more complex backstory for the character. The Grinch arrived in Whoville as an orphan. He was adopted and lovingly cared for by two women. However, in school is repeatedly bullied for his different appearance. While attempting to use Christmas to win the affections of his young love interest, further bullying by his classmates pushes him over the edge. He declares his hatred for Christmas and runs away from town to live as a recluse on Mount Crumpet, where he has been presumably living up to the present day. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CVG1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CVG1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 424w, https://substackcdn.com/image/fetch/$s_!CVG1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 848w, https://substackcdn.com/image/fetch/$s_!CVG1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 1272w, https://substackcdn.com/image/fetch/$s_!CVG1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CVG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png" width="220" height="326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:326,&quot;width&quot;:220,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:181078,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CVG1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 424w, https://substackcdn.com/image/fetch/$s_!CVG1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 848w, https://substackcdn.com/image/fetch/$s_!CVG1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 1272w, https://substackcdn.com/image/fetch/$s_!CVG1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de117f4-f479-47e5-a8f9-8eb7ada900c0_220x326.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Arguably the 2000 version provides the most consistent backstory for the Grinch (with the caveat that no adaptation offers an explanation for where the Grinch, adult or baby, really came from). The backstory explains both his disdain for the Whos, his loathing of the Christmas season, and his feelings toward presents. </p><p>The 2018 adaptation, <em>The Grinch</em>, by Illumination takes a similar approach as the 2000 version, with a slightly lighter touch. References are made to the Grinch living in an orphanage, feeling alone, and being left out of the Christmas activities. (He actually appears to be the only child in a very large orphanage, which is both confusing and sad.) No further details are provided on when or how he made it to his mountain retreat; and as for why, we're left to assume he never felt welcome in Who society. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nob-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nob-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 424w, https://substackcdn.com/image/fetch/$s_!nob-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 848w, https://substackcdn.com/image/fetch/$s_!nob-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 1272w, https://substackcdn.com/image/fetch/$s_!nob-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nob-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png" width="220" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:220,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:174679,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nob-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 424w, https://substackcdn.com/image/fetch/$s_!nob-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 848w, https://substackcdn.com/image/fetch/$s_!nob-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 1272w, https://substackcdn.com/image/fetch/$s_!nob-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6560924a-8863-4b3b-a191-7c1cd202ec2e_220x348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The movies differ more in their depiction of the present day. The original 1966 version is again scant on details. We never actually see Who society outside of the Grinch's night time escapades and his eventual reconciliation with them at the end. The Grinch has some visions of how they celebrate the holiday, but it is not a very holistic view, and it isn't clear if this is reality or how he imagines it. What we do see are brief snapshots of festivities; children opening gifts, eating a feast, and lastly (and the least tolerated by the Grinch) singing carols are a group. They sing, and sing, and sing, and the more the Grinch thinks about it, the more he resolves to stop the whole thing. Lastly, it isn't clear how the Whos see the Grinch, or if they are even aware of him, until the end. </p><p>The 2000 film elaborates significantly on Whoville in the present day. The Who culture (at least as it pertains to Christmas) is deeply materialistic. Anticipation of Christmas revolves around shopping and decorating, with a significant flavor of competitions between Who families. The Whos are anxious about purchasing the right gifts, and how their outward appearances are perceived by other Whos. </p><p>In the 2000 film, we also see the relationship between the Grinch and the present-day Whos. He is generally feared and reviled by the Whos. Teenagers dare each other to visit his cave (where he is more than happy to play up his own infamy). He visits Whoville in disguise, and when he is recognized by a young girl, she reacts in fear. He does have a mailbox at the Whoville post office, but Whos remark with disdain that there are never any Christmas cards coming and going. When little Cindy Lou Who suggests that the Grinch be made the Holiday Cheermeister  (because he most of all is in need of cheer), she knows it is a drastic act and it is not one without controversy. (The Grinch ultimately agrees to fulfill this role, and is briefly won over by the celebrations, but is ultimately reminded of past grievances, which is what motivates him to "steal" Christmas in this film.)</p><p>The 2018 film has a very different take; one that is much more optimistic. In the 2018 Whoville, we still see a Whoville that is infatuated with the Christmas season, and all the decorations and gifts it brings. It does, perhaps, at times even suffer from glimmers of the same tendency toward excess (e.g. the tree must be ten times bigger than last year). However, on the whole, the movie depicts a town which is much more balanced. The Whos are kind and cooperative. There is no competition. There is no one attempting to out do their neighbors.</p><p>As in the 2000 film, in the 2018 film the Grinch is shown interacting with the Whos. However, in the more recent film, the Whos do not appear to be afraid of the Grinch. He is greeted warmly and invited to participate in the festivities; it is only by his own and continual choice that he is excluded. The 2018 film goes so far as to depict a Who, living on the outskirts of town, that considers the Grinch to be his best friend (although the feeling does not appear to be mutual.)  </p><p>To summarize, in the 2000 film, (regardless of who is to blame) there is <em>mutual</em> animosity between the Grinch and the citizens of Whoville, while in the 2018 film, the animosity is only in the Grinch's head. </p><p>Up to this point, both movies still present a coherent view (though one more negative and one more positive). The difference between the Grinch and Whoville interactions in the two later films, however, is crucial when evaluating how the story resolves in each. </p><p>This brings us to the main plot. For reasons which vary slightly between the films, the Grinch resolves to "steal" Christmas. In every movie, he dresses as Santa, sneaks into town at night, and takes all the gifts and decorations. </p><p>The 1966 and 2000 versions have the Grinch about to destroy his loot, when he hears the Whos singing. He realizes that Christmas came without "ribbons, tags, packages, boxes, or bags" and that therefore Christmas must mean something a little bit more. Upon this revelation, his heart grows three times in size, and he returns to Whoville with everything he stole. He is forgiven and joins in the festivities. </p><p>My main objection to the two early versions (and the gripe that motivated me to write this essay) is that his revelation is a complete non sequitur. At no point in either of the early two versions does it hint at the fact that the Grinch was upset or bitter because he thought Christmas was a purely materialistic holiday. Even in the 2000 film, where the Grinch is vocally critical of the materialistic nature of Who society, it was hardly his reason for becoming a recluse in the first place (which has more to do with how he was treated by his peers), so it isn't clear why this is the revelation that softens his heart. </p><p>The 2018 version of the film, however, provides a much more satisfying version of the ending, and it is preceded by an element that is unique to this film.</p><p>In the two earlier films, as in the book, the Grinch's efforts to steal Christmas are almost foiled when he is interrupted by a young Who, little Cindy Lou Who (who in the book and first movie, is no more than two, but is a bit older in the later two films). She asks why Santa is stealing the tree and he explains he is taking it back to his workshop to repair it. </p><p>In the more recent film, the pair have a more in depth interaction, where Cindy asks the Grinch (who she again believes to be Santa) if he can offer some reprieve to her overworked mother. Cindy also explains how happy and at peace she feels when the Whos all sing together on Christmas morning. </p><blockquote><p>"Everyone should be happy, right?...I wish you could celebrate with us tomorrow. We all get together and sing. It's so beautiful that if you close your eyes and listen, all of your sadness just goes away."</p></blockquote><p>This has no immediate impact on the Grinch's plot, but plants a seed for later. </p><p>In the 2018 film, the Grinch isn't swayed because he realizes stealing the gifts had no effect. His heart grows because he finally lets go of the animosity which lived inside himself. </p><p>As in the previous films, he climbs to the top of Mount Crumpet to dispose of the gifts, he hears the Whos singing. He looks through a spyglass and sees little Cindy singing with the rest of the Whos. However, it isn't some realization as to the meaning of Christmas that sways him. Rather, it is his decision in that moment to take a leap of faith in the little Who's advice:</p><blockquote><p>As he watched the small girl, he thought he might melt.</p><p>If he did what she did, would he feel what she felt? </p><p>The luscious sound swelled, reaching up to the skies.</p><p>And the Grinch heard with his heart, and it tripled in size. [2]</p></blockquote><p>As in the other films, he returns the gifts. But in this film, rather than immediately joining in the holiday celebrations, he meekly returns back to his lair. The final resolution comes when Cindy herself explains he has been forgiven and extends the invitation for him to celebrate with them, further confirming there was never really excluded. </p><p>In this version the materialism of the Whos and the attack upon it was tangential. The Grinch saw, despite his efforts to ruin the material aspects of Christmas, the Whos were still happy. His salvation came not from recognizing the true spirit of Christmas, but accepting an invitation to a moment of vulnerability. The Grinch experiences that which has been there and available to him all along. </p><p>This is a much better ending. In this latest film by Illumination, we are finally given a Grinch story with both a consistent back story and an ending that is satisfying both in the traditional sense, and also in how it reconciles with the conflict introduced earlier in the story. </p><p>In conclusion, the 2018 film <em>The Grinch</em> by Illumination is the only retelling of the traditional Grinchian lore that offers a consistent, intact plot, and a satisfactory ending. The original story, and 1966 movie adaptation, while no doubt classics, are riddled with inconsistencies. Both the 2000 and 2018 films get credit for providing credible backstories which at least provide a hint of character motivation. But it is only the 2018 film, over 50 years later, which finally gives us the fleshed out Grinch and Whoville universe that we have deserved all this time. </p><div><hr></div><p>[1] It's also worth saying that if your personal criteria for "best" is "true to the original source," you can stop reading here because there is absolutely no contest on that point and I won't try to convince you otherwise. </p><p>[2] I've avoided remarking on the aesthetics of these films, but this is by far my favorite part of any Grinch film. I also appreciate Illumination's commitment in the latest film to writing new content in the Seussian style. </p>]]></content:encoded></item></channel></rss>