Imagine you’re heading down a road – biking, driving in your car, taking a walk – and you notice a trash can roll into the street. You quickly realize that any large obstruction rolling around is a potential hazard, for you and others. It’s not your trash can. It’s not your street. In fact, it’s likely that you can safely avoid the can each time you pass and have no direct negative consequences, though there’s still a slight chance of an incident on a long enough timeline.
By fetching it, there are at least two direct consequences:
- You present a greater hazard for what is likely a short amount of time, as vehicles must avoid a pedestrian and your nearby vehicle if you have one, along with a trash can.
- Once completed, you’ve removed an endangerment, thereby making a safer environment.
We could consider an ethical basis for this (“Is it right to leave a hazard for someone else if we take on a riskier course of action?”) but let’s put that aside and take a purely practical approach to this for simplicity. We’re looking to make a judgment whether it’s better to trade a lower but non-zero risk across an indeterminate amount of time for a temporary but higher risk to eliminate it long term.
Of course from just the information given, we can’t make a decision as we’re missing too many pieces of input. We can shrink the obstruction to a small soda can and put road flares around it, in which case there isn’t much to worry about and probably not worth the trouble. In the opposite direction, we can say there’s a school crossing nearby with a heavy snowfall obscuring visibility and making the roads icy, thereby leading to a greater tolerance for taking action.
We first have to decide whether it’s worth the effort, which requires a best guess at how risky it is.
How do we calculate risk?
Risk calculation is so commonplace throughout our lives we often don’t notice it. You see a glass at the edge of a table and move it away, knowing it could be bumped off. We don’t have to think about every possible scenario or even the worst cases. In fact, we can’t as it would send us into an infinitely deep task for the most mundane day to day actions. We take shortcuts, assumptions built on past experiences and then act quickly.
In the above example, the first thing we say is “Well, that’s not enough information to figure out the problem”. We’re constantly perceiving, taking in data whether we realize it or not to make decisions of varying degrees. When presented with a crisis (a trash can in the road), we may not have the luxury of time as we are continuously and quickly put into positions where decisions are required of us. Ask anyone paged at 3 AM if they were ready to diagnose an issue.
It’s also important to remember: Failure is not binary, as we work in systems that are in a constant flux of varying states of success (Complex systems run in degraded mode). Our ability to assess and reassess, to take in new data to formulate changing ideas is what allows us to reroute around these major and minor failures, to create success. You might think that, for as often as we do it, we might be better at assessing risk. It’s brought us this far, after all.
Our judgments are refined through exposure to hazards – and failure. We take on the inherent risk of attempting to diagnose a situation because systems require course corrections to maintain their steady state. It’s then required of us to do two things: to continually assess risk and then take action, even if it means our assumptions are wrong and we require further action.
We may not know the full extent of what our actions will take, but we are required to act. Our systems cannot stay up without our intervention, which means we’re required to put ourselves in situations of ever changing degrees of risk to avoid what we assess as potentially a greater threat.
Taking action can mean failure
Tech workspaces are not immune. Ask any engineer how much confidence they have in editing legacy code and you’re likely to get back an unenthusiastic response. Refactoring clunkiness in our codebases or paying down tech debt is costly, after all. By its nature, technical debt carries with it a sense of uneasiness, the trade off of getting things completed (accomplishing a larger or higher prioritized goal) in lieu of what you hope a system could be, more readily robust and understandable.
It’s also likely a cost not evenly shared. Say your frontend team manages a framework long since deprecated in the tech industry at large. Your website is running, it’s “fast enough” for some level of appreciation in working, and you haven’t seen any major bugs lately. In fact, your team managed to push through and even build out a whole new feature based on it. You point out how much faster it would be with a new system, but it’ll take months of rewriting with little appreciable difference for anyone but your team maintaining it. There’s also a chance that during the migration, your existing product could be impacted. Not every organization hands out raises or promotions for rewrites, either.
We are unable to gauge with complete clarity how risky any task is. All practitioner actions are gambles and therefore uncertainty (read: danger) exists regardless of our choices. We don’t know if taking action will make things better or potentially worse. It may be a case not of avoiding failure either, but of minimizing it (think shutting down an attack vector when a vulnerability is discovered by turning off a major service). Lots of times, we’re faced with accepting the least bad decision.
Some of our suggestions in the face of failure can seem counterproductive. Uncertain of what failure modes lie within our system? Let’s inject failure to see what happens through chaos experiments. Fires in our forests are out of control? Let’s set more fires to stop them (and cut off their supply). System resource availability in a critical state? Let’s start yanking some services from production to avoid total collapse. For those on the blunt end, they may seem catastrophic, but on the sharp end they’re the tools we reach for because we understand the challenges in this decision making.
What’s the Recourse?
This can feel like an impossible situation. We’re either eternally in danger or approaching danger, with our only option being to steer into one more or other dangerous situations. That’s not exactly a sentiment that can fill your team with a lot of confidence. It feels downright fatalistic.
Fortunately, humans are amazing and resilient components in our systems, and as such we can learn. In the absence of immediate danger, we can play out game days to limit impact on this dangerous work. In the midst of hazardous conditions, we can confer with others to pool our collective knowledge, broadening our sense of understanding through observability patterns we develop as caretakers of our system. And when the inevitable does strike, we can have the presence of mind to examine our misconceptions, with the hopes that the continuous efforts we put into our systems means the everyday danger that lurks in our systems becomes just that, ordinary.
A glass half empty mindset might say we’re doomed to continually experience failure, that no matter what we choose our systems will collapse. It’s pretty bleak, though as an optimist I’m inclined to see the silver lining. Instead, I’d prefer to say we’re all doomed to learn, should we choose to take up the opportunity to do so.