We often talk about blame aware culture. Your teams are continuously working towards building a system where, among many goals, a safe and reliable system is available. When we’re surprised, incidents happen. As we’re working towards safety, and by definition these incidents are surprises, shaming folks for failure is counterproductive and instead we should celebrate the opportunity to learn more.
It’s relatively new in software engineering with respect to other fields (air traffic control, medical science, etc.) but it’s familiar territory over the last handful of years. “Blameless post mortems!*” we shout in conference talks and blog posts, our incident analysis meetings begun with that mantra. It’s a big step, one that took a lot of effort to arrive here and yet with miles to go before we sleep.
Likewise, we’ve made some decent headway with gamedays and chaos engineering experiments (notably, in security – see the recently released Security Chaos Engineering book). We’ve relaxed the failed assumption of “we can’t fail here” and instead make the understanding of failure being the focus.
Failures, naturally occurring or artificial, are frequently noticeable and notable in their impact. They grab your attention, pull you by the collar, and force you to address them. “Ok, Ok, I’ll debug this thing!” and then we talk about it afterwards because “whew, can you believe everything we learned!?”.
This is good. We can do better still. Near misses are what’s next.
Near misses might be as subtle as noticing a typo in a config change just before deploying, to extrapolating from graphs an upward trend in memory allocation without the corresponding relief, to awareness of an unpatched server with a vulnerability begging to be exploited. What’s the difference between these events and a classic incident? Timing when someone noticed.
I’ll say that again, because it’s critical. Data from events that would likely produce failure were observed, informed decision making, and instead prevented failure. John Allspaw is noted (or at least that’s as far as my pointer dereferencing goes) for saying that “Resilience is the story of the outage that didn’t happen”. Near misses, then are semi-tactile realizations of that. The “semi” is because, yes literally we can’t touch them but also we often let them pass through our fingers.
How did we make this leap to success, an existing system that is currently stable but, without efforts, will drift into failure? Experts. Experience from previous efforts and an assessment of the current situation generated insights that saved us from failure. We might drop into our chat an exclamation of how close we were, a tip of the coffee mug and a thanks – but then what?
Just because nothing blew up doesn’t mean there isn’t something to learn. The best of us still fall into the outcome bias of assuming greater learning correlates with greater failure. So why don’t we put more effort into analysis of near misses?
- They’re regularly occurring as part of a functional safety system, becoming “noise” and “invisible”.
- We barely have time to do analysis on the incidents – now we have to do analysis on the non-incidents too!?
- It’s hard to fully appreciate the impact of something that didn’t happen.
- They’re “expected”, justified as non-heroic. If we don’t know the impact, it’s hard to gauge how successful it is.
- Resources tend to be constrained against this work and are only allocated after failure modes.
- Creating Safety is Dangerous Work
- A pat on the back isn’t as useful as a raise. If we’re not rewarded as part of our work, we’ll expend fewer resources in these efforts.
- It may come off as self-congratulatory.
It’s understandable. Being invisible, and us as finite beings, we have to make trade offs in where we expend energy. That’s great we saved ourselves from an outage, but that doesn’t push back my deadline on my current project or I may assume others would have done the same.
I’d highly encourage taking time to explore this more often, even once a quarter as limited as that may be. Some questions that would fit for a retro on near misses:
- Who else on the team would make the same connection?
- Would those folks have different approaches towards a solution?
- Would they have used the same tools to gain the same insights?
- What led folks to look at said graph, review the pull request in just such a way, explore the installed packages? We want folks to gain that same muscle memory to make that happen.
- Why were our systems left vulnerable to failure previously? Note: This can sound blameful. It is not. Understanding the justification of effort in events and why people do what they do is required.
We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.
* Ok, some of us use alternatives like “blame aware” and “retrospectives”.