No, seriously. Root Cause is a Fallacy.

I’m just back from attending SREcon ’18 Americas in Santa Clara last week, an incredible conference I’ve spoken at before in Dublin in 2016 as a tutorial, but never in the U.S. You can find some blog posts written about specifics (Day 1, Day 2, Day 3), but I wouldn’t be able to do it justice myself, so read those! Kudos to everyone involved in the hard work making it run so smoothly, I can only imagine everything that went into doing so. Presenters were warm and welcoming with deep insights to share and the attendees full of great questions, appending their own experiences to topics at hand. I was also able to meet up with some old friends and make a bunch of new ones. Everything you can hope for when attending a conference.

I wanted to expand a bit on a particular that I mentioned in my talk that seemed to elicit questions from folks after in person and on twitter: Root Cause is a Fallacy. We’ve used root cause as a shortcut for explaining away problems for a long time, typically as part of RCA (Root Cause Analysis). I’m not the first to write about this. I doubt I’m even the hundredth, and I probably won’t be the last. But we’re still lazily falling back to using it, so it’s good to reinforce.

Background on root cause thinking

Let’s start with some understanding behind the appeal of root cause. The thinking is that you want to get to the underlying problem, starting at where it begins, rather than treating the downstream effects. I can appreciate resolving deeper underlying issues rather than “treating the symptoms” when problems large or small crop up. Our systems are complex. It’s very tempting to look at a singular part in an effort to simplify our understanding and achieve resolution. We’re wired to take shortcuts and to be efficient in what we do. This is not inherently bad or even flawed. Traditionally referred to as Root Cause Analysis (RCA), it’s often a part of the Five Whys strategy for investigation.

So with all of this appreciation of a simplified resolution, what’s the problem? Isn’t simpler better, a break down into smaller parts to better understand these complex systems?

The investigative work into digging through the influences surrounding an event becomes shallow. When we say root cause, we’re reducing the scope significantly and throwing away data. It’s lossy – we’re “leaving things on the table” to be discovered and learned from. One root cause implies one problem with one answer. If there were more reasons a situation failed (there always are), why would you need more solutions? If there are multiple paths to failure (there always are), then you don’t have a singular root cause. Hence, shallow investigations and shallow learning.

That’s when a tendency arises to modify the definition to shoehorn it into a model that fits our use case. We could instead refer to it as root causes, plural. This naturally invalidates itself when redefined, though.

Each of those individual singular events or actions have countless influencing actions, for all practical purposes – we’re finite creatures and we only have so much time/energy to devote to investigating all of them. The failure in this line of thinking is similar – we’re ignoring how these individual events or atomic ideas are interrelated. None of these are existing in their own microcosm except for in our heads. Our simplification is leading us down erroneous paths to make the world fit a model that doesn’t map to our environment.

What’s the root cause of success?

Another great way to look at it is to approach the same line of thinking with success. When building a successful project, there’s never just one thing that goes right for it to succeed. Was it your business team’s planning around the new product that helped increase your customer impact? Was it the designer whose critical thinking allowed you to build a functional UX layout to reduce friction for your downstream consumers? Did the QA team’s thorough inspection to catch edge cases make it more robust to a wider audience? How about your infrastructure team that built a monitoring system to allow for quick insights into potentially hazardous situations to be handled quickly and without significant impact? Or your delivery team, building a toolset that allows you to make incremental changes and fixes to reduce the scope? Or..?

I could go on, making clear how each of these teams contribute to succeeding, or dig into the specifics of each member of a team’s contributions, the decisions along the way, the education that gave them a background to do all this work. This is no different from failure modes. Interrelated work from parts within and external to your organization influence one another to some end.

The Five Whys

A common question I received after my talk was, roughly:

“But with the Five Whys, we’ve been able to successfully use RCA! You can’t deny that we’ve made progress because of the Five Whys, so root cause must have value.”

The Five Whys investigative practice itself is problematic. Success using it doesn’t mean its parts are appropriate for your needs. This is analogous to saying I can use my shoe to hammer in a nail. It may work, but is it as effective a tool as you want?

With the Five Whys, you ask a question about why an event happened. When you get that answer, you question what preceded that, to continue investigating deeper. You continue this several times (five is a ballpark of how deep you should go). The theory is in doing so, you’re going beyond one shallow explanation.

While that’s true you’re not stopping at one question, you’re also thinking in a very linear path. Each one of these questions is a depth first search without also including a breadth first search into investigation. It also, again, assumes that these branches of our investigation don’t also relate to one another.

From the Wikipedia entry on Five Whys, an example about a vehicle failing to start (the problem):

Why? – The battery is dead. (First why)
Why? – The alternator is not functioning. (Second why)
Why? – The alternator belt has broken. (Third why)
Why? – The alternator belt was well beyond its useful service life and not replaced. (Fourth why)
Why? – The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

The failure in this investigative approach is that each step has one and only one influencing factor. The alternator could be broken, but was there a significant use of battery power that prevented the driver from getting to the mechanic to fix it sooner? Is there a back up battery available and if not, what was the decision process around choosing to have one, if there was an active decision made? Was this a used car recently purchased, the new owner failing to recognize the need for a check up? Or perhaps the driver was on a long car trip, with the determination that they could reach their destination without need for a servicing? Were there environmental differences in which the car was used, leading the belt to receive more wear-and-tear that it might not have in testing? The questions go well beyond this simplistic thread.

I was lucky enough to spend some time at SREcon Americas 2018 with J. Paul Reed, who you may know from his work in human factors as well. Speaking to him about this, he had the following to say:

My fun experiment to do with teams doing Five Whys is to do it in isolation with each team member. 5W tends to diverge after the second or third question.

A lot of interesting issues with that.

— J. Paul Reed (@jpaulreed) March 30, 2018

Adopting new language is hard

Eliminating root cause as a descriptor in your incident reviews isn’t easy. It’s ingrained in us, within our practice and in the world at large. We want our answers to be simple. The concept of root cause is prevalent and breaking away from it isn’t something that happens over night. I’ve been fortunate enough to work with smart people who recognize the necessity of moving away from this language and are also able to do so without it feeling like an attack. “Gotcha! You said root cause!” should not be in our parlance either, as we’re not trying to one-up each other. If you find yourself saying it or notice someone else, ask to clarify! Some helpful ways I’ve worked around this before, if you’re looking to do the same when someone mentions root cause, ask:

Can you clarify if there were any preceding events?
Why would they believe acting in this way was the best course of action to deliver the desired outcome?
Is there another failure mode that could present here?
What decisions or events prior to this made this work before?
Why stop there – are there places to dig deeper that could shine a light more on this?
Did others step in to help, to advise, or to intercede?

Anything that can show we have further to dig, should we choose to, dispels the notion of root cause.

If you’re looking for a substitute for the phrase “root cause”, one that would typically allow for you to illustrate an important point without conceding that there’s a place to stop (beyond our finite limit to dig into everything), I use the following:

Contributing factors
Surrounding events
One of many components
Influencing ideas
Conditions
States

Additional info

As I mentioned, I’m not the only person to have broached this topic. If you’re interested in other opinions surrounding this, I highly suggest the following:

Dr. Richard Cook: “How Complex Systems Fail” (pdf) and Velocity 2013 NY, “Resilience in Complex Adaptive Systems” (Video)
Baron Schwartz: “The Root Cause Fallacy“
The IT Skeptic: “No such thing as Root Cause“