Resilience Engineering and Error Budgets

This post on error budgets should be considered fluid, ideas worked in and out as any good beliefs should. My experiences with error budgets are not universal and should not be assumed as decrying anyone who has had success using them. I strongly welcome thoughtful, critical feedback and assume best intent from anyone who disagrees.

I’m not a fan of error budgets. I’ve never seen them implemented particularly well up close, though I know lots of folks who say it works for them. I’m not ready to declare bankruptcy on the practice, though I’d like to highlight some of my concerns with respect to human factors, safety, and resilience engineering.

My understanding has mostly been built from conference talks about implementation (“Here’s how to figure out your SLO’s/SLI’s/SLA’s”) and casual conversation with friends that are more ardent supporters. It’s harder to reference them with any specificity – that feels more like hearsay, or at least is easier to dismiss as “well, that’s not really how to do error budgets”. Instead, I’ll reference the online Google SRE Book, as that’s available to anyone reading this and feels fairly canonical.

Briefly, I will say that if it is working for you – great! I don’t want to dismiss anyone who feels their org is successful with error budgets. They share a similar space in my head with “best practices” as an expression. I don’t agree with it entirely and would typically advise other avenues, but if you’re particularly in favor of their effectiveness, then keep doing good.

Building an Error Budget

I’ll almost certainly do a disservice to fully define error budgets, so I would suggest reading up on them in the Google SRE book. For the impatient, the process involves:

Developing Service Level Objectives (SLO) by gathering data and talking to your users.
Generating Service Level Indicators (SLI), a measure of core signals for your service.
Forming a Service Level Agreement (SLA), a contract to maintain the SLO’s for your service, with an actionable response if not.

The error budget is then the difference between the SLA and the existing availability – sometimes measured uptime, sometimes number of errors in a given period, or any way that you’ve decided upon as your SLI.

The above is an unordered list as I’ve heard philosophies of developing these in various orders, depending upon your service needs, approaches towards production development, etc. They’re a model, though, and while no model maps 100% to real world experiences, they’ll only benefit from greater due diligence. The core concern I have with error budgets is that we’re losing the value through this filtering process.

I get SLO’s. You should know what your service does and there’s value in that. SLI’s get a little fuzzier, as you’re trying to map discrete numbers (failure reqs, site response time, etc.) as metrics quantifying these SLO’s, and by the very nature of models there is information lost. This blurs even further as we map again an SLA to an SLI, which is a number we think about for a while and decide “yes, this is how good it should be”, then push back and forth on it as needed, typically with a contract associated if the SLA is not met.

If we’re going to use all of these, we need to understand that this isn’t a science, but rough directions on where to go. Some folks are very honest about this in their development cycle and that’s appreciated. Others make promises on SLA’s, like SaaS companies to clients, and at that point it’s just marketing.

A Control on Velocity

“If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient”
Google SRE, Chapter 3 – Embracing Risk, “Benefits”

The book continues with the above quote, giving as an example developers wishing to go faster by skimping on testing. So long as their error budget isn’t expended, they are allowed to do so. In nearing or exceeding the predetermined values, they will be incentivized to slow down deploys to make the system “more resilient” (I’m not feeling a “Resilience Engineering” definition of resilient here but assuming best intent).

I strongly disagree with this as a methodology to adding a control for risk. To summarize, the boundaries of safe engineering don’t exist on a single vector. It’s reasonably desirable to want a simple process to define risk, one that has a foundation of “neutrality” and “fairness”. These are understandable goals and the temptation to find short cuts is strong. Unfortunately, not everything can be so simple. This thinking is intertwined with linear causal thinking (essentially, root cause) with simplistic solutions to resolve issues. The Google SRE book cites in several places finding root cause as an end goal for investigation as well.

Talking with colleagues, many adopt several SLA’s to get more of a comprehensive idea of system health. Measuring your temperature as higher than 98.6 degrees F indicates “something abnormal” but does not immediately determine how serious an issue is. Do you pop some medicine or immediately rush to the ER? You don’t know yet! It could be a small cold or it could be life threatening. It requires more investigation first, as should our socio-technical systems.

The reductionist viewpoint in understanding failure also sees more change as equivalent to an increased chance of failure. If we deploy less often, and spend that time instead on testing/shoring up our infra, then our likelihood of failure will certainly go down. This, much like chasing 9’s, has diminishing returns. Preparation is essential to building safer environments in which we can deploy, but practicing deploying and having the ability to deploy allows us to achieve more desirable environments. In short, we deploy because we want to make things better, so we better be able to do it often and do it well. Similarly, by deploying so often, we’re building the skillsets to recognize anomalous conditions and respond to them. We’re better engineers from practicing what we do.

The Uncertainty of Risk

The key advantage of this framing is that it unlocks explicit, thoughtful risktaking.
Google SRE, Chapter 3 – Embracing Risk, “Managing Risk”

We don’t know with certainty just how risky an action is until it’s taken because the “risk” in a system is all around us – to deploy or not to deploy, or how to reason about making a change. We can give extreme examples of what might often be considered unsafe (“Our servers are public facing and have passwordless ssh!”) but failure has a strange way of routing around our “best practices” as well. The gray area between what we deem (after the fact) “safe” and “unsafe” has no well defined boundary. Instead, we construct a timeline of events afterwards and make judgment calls about our actions with the benefit of hindsight. With experience, practitioners can gain insights into development that mirror prognostication – e.g. “I felt like something was wrong, so I acted”, but the stories behind many near misses, or incidents that could have fared worse, begin with the belief that things were “normal”.

How does this relate to error budgets? As mentioned earlier, error budgets give product engineers the ability to deploy in what may be reasoned as a higher threshold for risk in exchange for velocity. This isn’t uncommon. As an example, think about a deploy that absolutely must be in by the weekend, so it’s deployed on a Friday just before folks end their work week. Most engineers might eye twitch at that. We justify that risk by saying the cost of not deploying is too high and, should anything go wrong, we’re willing to pay the potential cost of staying later or responding to a page. Perhaps there’s some extended communication about the change. Error budgets are a way of qualifying that by availability over time instead of individually debating each higher risk deploy as it goes out. This reduction in friction by way of a simple delimiter (i.e, you’re either under or over budget) is eliminating the important part of understanding the decision making. Should we justify every deploy to all channels? No, it’s a judgment call, but basing a decision on error budgets also eliminates nuance. It oversimplifies a complex decision.

Forgetting high risk changes for a moment, let’s talk about boring changes. Have you or someone you know broken production from a “safe” deploy before? Normal, every day changes – things we believe to be perfectly safe – are involved in some particularly creative outages. Our expectation is that high risk deployments produce failures, but that’s not always the case. We tend to look for simplistic answers to complex system failures, and the idea of any change potentially wreaking havoc is scary. Often, it’s many small changes, seemingly innocuous on their own, that then in confluence bear unexpected results. When we use error budgets as a means of managing risk, we overlook this fact. Testing may help catch some unforeseen issues, but not all of them. In that same context, we don’t know when we’ll be hit by failures, be it within or outside the existing budget.

The question of velocity vs. reliability makes an assumption that all failures are explicitly made through this trade off. If our availability drops beneath an agreed upon limit, then we slow down to examine our systems and to patch the holes, including shoring up testing. Yes, adding guard rails and investigating the decisions involved are important to a well functioning engineering team. Dark debt, however, suggests that connected parts of a system and their interactions also produce failures without this conscious choice. We’re not choosing to move faster and act “riskier” in a change, but a failure may happen all the same. A decrease in deploy velocity due to an expended error budget doesn’t account for slow, steady changes done with great deliberation. Critical failures can happen not because we’ve chosen to take a risk, but because we don’t even know the risk is there to begin with.

Shared Incentive or Hidden Punishment?

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Google SRE, Chapter 3 – Embracing Risk, “Forming Your Error Budget”

When did we decide product developers don’t like reliability in their change sets? I’m unclear as to why, when developing the idea behind error budgets, it was agreed that we needed to create a shared incentive between SREs and Product Dev that didn’t previously exist. Devs don’t want their code to cause bugs or break in production, and deploying new features is a primary focus of many product devs, but if it doesn’t work then it’s not much of a feature. The velocity above all else is a poor characterization of developers and their intent to do good work – and I’d venture to guess a deeper systemic problem. Simultaneously, it places this “gatekeeper” role on the SREs, not to mention glossing over desired features and improvements from their teams that want to produce quickly. Error budgets are in theory supposed to remove that trope – now it’s up to the current budget if you can make a risky deploy! I’d much rather see holistic ways of communicating the burdens of production failures typically associated with SREs, Sysadmins, and Ops. Sitting in chaos experiments or post incident reviews, early product planning, embedded team work or team rotations – all of these are just a few ways to build those empathetic bonds and break down the silo-ing of team features vs. team reliability.

We’re looking for fairness and neutrality here, devoid of politics between teams. Seeking local maxima without negatively impacting neighboring teams can be tricky, and even within teams it’s not easy. “We get rewarded when we deliver great new products” and “we need to make production run smoothly” don’t have to be mutually exclusive. To simplify on a shared commonality would be great, but I don’t believe they are in conflict. Even in best intent scenarios, I can’t help but imagine folks wanting to either game the system or fudge numbers until they fit the story they want to tell. “Ok, we blew the error budget, but we really need to get these new features out for the launch date we announced!”. Our code, our algorithms, and our systems aren’t empty of politics. To say so is ignoring the socio-technical implications of our work.

I hate using numbers to generate team cooperation. I don’t believe there is a DevOps vs. SRE conflict that exists in reality, except that artificially creating a numeric goal, anomalous and negotiated as it may be, as a joint incentive between teams isn’t healthy. Empathy is critical to teams coordinating, and while having a shared goal is good, I don’t see the use of error budgets as a mean of engendering that. SLI’s can be the start of a conversation for that, but if you need to have a dividing line on when to care about another team’s problem statement, I think there’s a deeper problem to be solved.

So where do we go from here?

I understand the desire for error budgets. We’re looking for understanding as to when we can best take stock of our systems to know when it needs shoring up. If we’re looking for direction, I think we can find that in SLO’s to define what we ideally want out of our services and SLI’s to give us rough estimates of how well we’re doing. When we’re assigning contracts to SLA’s and determining risk through error budgets, we’re losing a lot of the critical parts of solid resilience engineering. My hope is that as SRE practices continue to develop, we can find some common ground to avoid these pitfalls.