As with most of my blog posts, this should be considered a living document, the ideas offered here being malleable, as I would hope the document that it references be flexible to new ideas. Conversations surrounding this welcomed and encouraged as we all continue to learn.
I recently came across Pagerduty’s documentation surrounding their philosophy on the Incident Commander (here regarded as IC, not to be confused with “Individual Contributor”). I’m glad as an industry we’re investing more into this work, as we can’t understate the criticality of surfacing hidden truths in our systems that investigations, both during an incident and after when we have the benefit of (more) time to invest without the pressure of immediate resolution. The spirit of improvement is here and I applaud Pagerduty for shedding light on a subject needing more attention. I also think it’s easy to disagree with an article that clearly has a lot of forethought and effort – I don’t want to diminish their hard work with a snarky blog post. All that said, I have strong differing opinions that run counter to this document.
Let’s start with their definition of an IC:
Take whatever actions are necessary to protect PagerDuty systems and customers.
The branding here is fine, and I understand why they would include that, as well as an attempt at a succinct definition to build from. Let’s start with the back half – “protect PagerDuty systems and customers”. We have two goals that need to be met, the systems and the customers. What happens when the goals for each are opposing? More specifically, the customers are a part of the system, as much as the testing suite, dev environments, etc. are.
As a hypothetical, let’s say you deploy new code changes to an alerting system that is running a health check against your production environment, a query that runs against a primary database. It takes some time to run the checks and deploy this change to your fleet in prod, say a few minutes, but it finally completes. Upon the first check being fired, you see now the query is rather slow, running a full table scan instead of relying upon an index. The performance on your site slows because of this and some of your customers are experiencing timeouts. You can pull the plug on your alerting here, resolving immediately the pressure applied to your databases and so the consumers, or you can wait several minutes to revert this code change while your end users suffer. Two or more subsystems can very often have opposing goals to each another when a need for resolution occurs.
I might be picking on alerting a bit, but it could easily be a number of subsystems (perhaps more than 2 in opposition) that run up against one another:
- A configuration change to your varnish cluster reduces the effectiveness of caching of various endpoints, forcing more requests to your web servers, backing up requests.
- An asynchronous job cluster that emails out messages to your end users is flooding your network with traffic.
- The index for your logging subsystem falls behind, but you have a scheduled release for a product, leaving you potentially flying blind as you deploy changes.
This brings me back to the first half of the definition: “Whatever it takes”. Depending upon what part of the subsystem your primary focus is, this understanding of resolution will differ and priorities will conflict. The open ended definition allows room to move around, but it also becomes a non-statement. You could easily replace this with “Just fix it”. Your team may say “Whatever it takes” but that is a cost that is non-negotiable for others internal and external to your org.
The Pagerduty doc continues as follows (highlighting my own except for the third point, which is theirs):
The purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution.
The Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final.
Your job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. You should not be performing any actions or remediations, checking graphs, or investigating logs. Those tasks should be delegated.
I appreciate the role here of taking a bird’s eye view of the situation and I agree. If your focus is trying to bring up a server and coordinating with others, you have competing goals. Relying upon those delegated to be your eyes and ears to investigate frees you up and empowers them, giving others the direction they need. This can be a difficult position to be in as an engineer. Our natural instinct is to hop into the thick of it and start plugging away.
Maintaining a “my decision is final” attitude, though, discourages pushback and input from these experts when their decision runs counter to what those who are running the remediation may believe. The tendency is therefore an IC to give top down directions when it’s preferable for them to be more a centralized hub of several nodes, relaying information between said experts. I would prefer a description in which the IC facilitates the transfer of information rather than synthesizing it for others to then act upon. The push and pull of a democratic agreement with all parties agreeing vs. an authoritarian approach can vary wildly between organizations, within an organization, and within the make up of a specific incident.
This also begs the question – should the experts acting up an incident confirm each atomic piece of work? This quickly becomes laborious to constantly ask. If they don’t ask often enough, they may run into situations where they wait for the IC to give them the green light (bottlenecking resolution) or in the opposite direction completely, act without full knowledge of those around them for fear they are distracting/adding too much noise. I don’t believe in a hard and fast rule that could apply to all situations, and so adaptation should be applied to situations as they arise. This is illustrated under “Gaining Consensus“. Saying “Does anyone strongly agree?” is a bias for those who are willing to speak up, which can be a vicious cycle (those who don’t speak up are talked over, reducing confidence in future events to speak up / those who do speak up are seen as “natural leaders” and listened to more often regardless of their expertise). Everyone should feel empowered to contribute to a conversation and that is the point made. Finding a way to encourage that is the difficulty.
Finally, they conclude with the caveat that the IC should not perform any actions. Reaffirming my agreement from above, drawing that line can be useful for understanding individual roles and setting up boundaries, but I’m wary of absolutes. If the IC is more of an expert in the domain than those performing the remediation (a senior engineer working with a junior engineer for example to help them train), the IC will need to impart knowledge, which inches them closer to doing the work themselves. A senior DBA is running the incident as an IC with a junior DBA running the commands. The senior DBA has made similar fixes before and knows the particular incantation that will assuage the ghosts in the machine. Do they adhere to the rule that they should not perform any action, relaying this information to the junior DBA, or do they let their actions proceed while end users are impacted?
My gut says it’s an attempt to separate the decision making from the system upon which they’re acting. I’m projecting a bit here, so that could be a mischaracterization of their intended goal. I’d prefer an emphasis on the complex nature of people as part of the system and their interacting role (see below in “Fallacy of Root Cause”).
To sum up, these rules can be useful, but should not be set in stone.
Pagerduty is coming from a position that anyone can and should be an IC at some point. I fully agree with this. It’s a great position to be in to learn and we as an industry don’t encourage folks who are eager to learn more to hop in. Separating responders from an incident for “lack of expertise” doesn’t help educate nor encourage a sense of confidence when it is their time to act. The importance of modifying plans on-the-fly as necessary as a response to feedback is A+. I would add to that a knowledge of how to agree upon a course of action when experts give differing approaches to said plans. I would cast doubt on their assertion that two major incidents makes for an IC, though. Why not 3, 4, or 5? Is one sufficient if they were *really* paying attention? Later, under “Graduation“, they mention a lack of ceremony or official blessing. I’m encouraged by that, as having some sort of line of demarcation of who can and can’t is often problematic. This is slightly undercut further on when noting who is and isn’t qualified to be an IC in the “Deputy” section.
I also believe “having gravitas” and “willing to kick people off a call” are false indicators of effective authority (see Dunning-Kruger). I have known plenty of individuals who trade confidence for knowledge and people who choose themselves for this position, which tends towards a self-selection bias . It’s also a bias for specific individuals in tech, most often straight, white, and cisgender men (put into better words than I can illustrate, a blog post on the topic can be found here: https://jennbbinis.com/uncategorized/overconfident-men/). If we’re going to advocate for something like this, it has to come with a lot of additional info and caveats, which are lacking.
Fallacy of Root Cause
There’s a general usage of words throughout the doc that can be either ineffective at illustrating complexities of the system the team is a part of or can hamper investigation during or after the incident. In particular, they use “root cause” or “primary cause” of an event, with the belief that it was the only thing holding back the system from functioning as is. A recurring theme in Human Factors research is the repudiation of the concept of a single person or atomic event that is the keystone for the current state of the world. As John Allspaw has said multiple times: “Root Cause” is simply the place you stop looking. The systems we work through are constantly changing, the determining factors for their perceived stability in flux, and an adaptation to stimuli internal and external that allows it to achieve this experienced resilience. By focusing on a single instance as the sole cause, we remove attention from the surrounding interactions that contribute, leaving critical information hidden that desperately needs surfacing. The depth of RCA being insufficient for incident review is a much broader topic and goes beyond the scope of this one blog post.
The singular IC and their “removal from the system”
Highlighted in the Stella report is Dr. David Woods’ theorem on our systems:
As the complexity of a system increases, the accuracy of any single agent’s own model of the system decreases
There’s also a theme of the IC being in charge and in control throughout, with deputies to help facilitate any work that needs further delegation from their role or experts to perform the work. I would like to see more emphasis on asking for the experts to weigh in rather than be relays for information gathering beyond “Who disagrees?” or “Give me a status update”. While I agree that an IC attempting to jump into every investigative branch can lead to problems such as duplication of effort, confusion on ownership of tasks, and a lack of cohesion as to what to accomplish, I worry that this document focuses on it as a removal of the self from the system, another fallacy. Highlighting again from the Stella report, John Allspaw notes the differences of Above the Line vs. Below the Line – the IC is still part of the complex interactions. Inserting experts along the way doesn’t remove the IC from execution any more than having a script run a query against a database remove you from interacting with it. The Pagerduty doc does not explicitly state this, but I want to emphasize this as part of the IC role.
The section on “Handling Incidents” can be summed up as a loop of Size-Up -> Verify -> Stabilize -> Update -> Size-Up…etc. In general I’m a fan here too, taking steps to understand and gain consensus before acting while maintaining an approach to change when the needs require it, making sure to poll and update frequently. That’s the part of the IC I would love to see this focus more on and move away from the arbiter that other parts seem to emphasize. Three quibbles I might have with the “Stabilize” step are as follows:
- Asking “How risky is this?” would be better phrased as “What’s your level of confidence in this proposed solution?”. The difference is highlighting your expectations of what will happen as opposed to the “risk” which exists external to your decision making. Your expectation of the riskiness could be way off, but your confidence level surfaces your approach and your understanding of the situation.
- “Making the wrong decision is better than no decision”. Not making a change is still a decision and there are plenty of times where our actions make things worse.
- It’s never illustrated what stability is or where it’s decided. That will change from situation to situation of course, but it would help to focus on a step to confirm what the shared belief in stable should be (Do we fix the underlying problem or do we wire it off to avoid any further heavier loss of data/money/corruption/etc.?). Stability itself is a moving target, one that forces us to update our mental model on what it is and to change our system to keep course with it.
I also glanced at their post mortem notes written up that I have continued thoughts on, but I’ll save that for another blog post as to avoid getting too longwinded here. I’ll leave with a few other topics I’d love for us to expand upon from all of this:
- What happens when you don’t have new updates on progress during an incident? Do you still post externally/internally “This is ongoing”, even if it’s a long running incident with few new changes?
- How do you minimize the impact of aggressive actions (kicking people off calls, even the CEO) during high stress situations? The emotional toil of conflict can have long lasting effects that can be much worse than an outage.
- How do we keep track of running theories on solutions? Is it one large doc, a spreadsheet, etc. and is it available for reading or writing by other participants? My gut says no if the IC is calling the shots, would love to hear strategies around this.
- The last parts of the doc reference the movie Apollo 13 as an ideal scenario. This entire scene is apocryphal and has only a passing basis in reality. “Failure is not an option” is both a fallacy (failures are constantly happening all the time, it’s how we respond to the systems we’re a part of that is important) and was never said by Gene Kranz, though he did adopt it for his book after the movie popularized it. This model of good incidence response is manufactured.
- How often are “rules of thumb”, runbooks/wiki’s/documentation, cargo culting to restore systems, etc. observed and acted on? We can’t fit everything into our head for the complex situations we’re in, so some insight into when to reach for this knowledge and when to abandon it would be significant.
I don’t have all the answers to the questions I’m posing, but hopefully asking them is a step in that direction.