Big Enough to Fail

I’ve been playing with an idea based mostly on anecdotal evidence: At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.

A blanket statement applicable to any system feels egregious, but I’m confident enough to apply this to the sociotechnical systems of software development. This might seem odd – large failure modes are typically frowned upon, wide reaching and deeply felt with costs pushed to customers. Why would we give these highly critical services a pass?

The unbelievable happens every day

Take the recent GCP failure: A major provider of internet services has a major outage, one that is in all likelihood recoverable but will take some time. It even coincides with an outage at Cloudflare (the rumor mill saying the third party vendor mentioned was GCP). Who knows how many working hours were devoted to localized investigations, the money lost from sales and time invested, with customers of these services that provide further services downstream to others? Yet when something as widespread as a cloud provider has issues, when someone as big as Alphabet has a disruption, it somehow becomes much more understandable than a mid-tier SaaS having an outage.

Salesforce had a major outage across many major services as well recently. With the proliferation of Slack being near omnipresent in organizations, only one of many products that was impacted, whenever it goes down it has a large ripple effect across the industry in how we communicate. Even if that isn’t a service within your org, it’s highly likely that you’re reliant upon another company that has at least one Salesforce product as a dependency.

There’s likely some grumbling online and internally, sure, but are we shopping around for replacements? Often no, with exceptions of course – see Disney ditches Slack. I don’t think it’s some magic threshold where no wrong can be done to dislodge customers. On the whole, though, we’re pretty forgiving of big tech when things go wrong, despite the cost downstream.

Rationalizing choices at scale

So why are we more tolerant of this level of outage to such core pieces of our infrastructure? I have a few hunches that could use more scientific rigor:

  • It’s so exceptional (or feels that way). This is less so about frequency but that when a company becomes so big you just assume they’re impervious to failure, a shock and awe to the impossible.
  • The lack of choices in services informs your response. Are there other providers? Sure, but with the continuous consolidation of businesses, we have fewer options every day.
  • You’re locked in on your choices. Are you going to knock on Google’s door and complain, take three years to move out of one virtual data center and into another, while retraining your staff, updating your internal documents, and updating your code? No, you’re likely not.
  • Failover is costly. Similarly, those at the sharp end know that the level of effort in building failover for something like this is frequently impractical. It would cost too much to set up, to maintain as developers, it would remove effort that could be put towards new features, and the financial cost backing that might be considered infeasible.
  • The brittleness is everywhere. The level of complexity and the highly coupled nature of interconnected services means we’ve become brittle to failures. Doubly so when those services are the underpinnings of what we build on. “The internet is down today” as the saying goes, despite the internet having no principle nucleus. This is considered acceptable.
  • We’re all in it together. When a service as large as these goes down, there’s a good chance we’re seeing so many failures in so many places that it becomes reasonable to also be down. Your competitors are likely down, your customers might be – there might be too much failure to go around to cast it in any one direction.

This may also be predicated upon the misconception that if you’re of sufficient size you should never have downtime, the assumption that failure is always a choice. Being such a fundamental building block of other downstream services, you should prepare for every eventuality and therefore never fail. Even Alphabet, Amazon, Meta, Apple, and others of a similar size have finite resources. There isn’t enough money, time, or people to make sure a company never has failures. Even well before that, companies are more likely to underestimate future failures, under prepare, and assume pass performance will indicate future success. We’re currently seeing the unbelievable, the farcical idea that anyone can be a victim of failure, and that shocking revelation has a strange calming effect when blame is to be cast.

And maybe saying it’s blame free is incorrect – there are certainly individuals who will still happily point fingers in such a case, because it’s free and makes them feel better. Maybe they squeeze a better deal out of an upcoming contract because of it. “Well you chose the wrong service”, which is very convenient in that both the assumption other services have never/will never fail, the costs associated with them are neglected, and this was something foreseeable. The inconvenience of an outage can be very convenient to localized goals.

How do we create resilience with this in mind?

One of the objections to Resilience Engineering is often “this all sounds fine, but what do I do with this information?”. It’s the challenge with invisible work and glue work, the idea of preventing and preparing doesn’t mean it’s immediately present – unless you go looking for it.

It’s good to think about what parallels we can draw from other systems and environments. If you have a power outage, you compensate by keeping refrigerators closed more often or, should desperation befall you, start making your way through the ice cream in the freezer. A burst pipe or water contamination may force the issue of a boil water advisory. Severe flooding can force a response of trivial issues like tying down or bringing in outdoor furniture all the way to a full on evacuation. Much like living free from any environmental disasters, it may be impractical or impossible to avoid. Atypical failures warrant some level of prep to help support the adaptation.

If we’re giving ourselves an opportunity to look past simplistic failure modes (“Pick a service that never goes down”), that’s a chance to consider options elsewhere in our ongoing work towards reliability. A core part of Resilience Engineering is reframing how we look at incidents – their ability to highlight gaps in the system and produce insights. For example, incident response can be more than fixing an issue. Are alerts meaningful, indicating failure modes that are actionable? Do folks feel confident adding to a status page? When we look for error modes, do we flail about with uncertainty while we gather information, or is there systemic tooling that helps gather that information? Think about all the questions you can ask yourself about your system in response to the extraordinary.

This could be a whole talk, but it’s clear that “well, it’s not our bug” isn’t a sufficient answer, even if we’re not pointing fingers. If we’re giving ourselves the benefit of being blame free, or however we frame it, we can also give ourselves the chance to put time and energy to parts of our organization often overlooked. And maybe given some of those smaller companies a break too. After all, no one’s perfect.

Photo: https://www.geograph.org.uk/photo/4966439

Leave a Reply

Your email address will not be published. Required fields are marked *