At Jeli, we’re building the world’s best incident analysis tools. With decades of combined knowledge from our time at tech’s leading companies, we’ve developed deep insights into how users can better understand their shared experiences. We’ve seen these problems directly in our time as engineers throughout the stack. We’ve delved into countless incidents after the fact to make sense of the murkiness surrounding them. We know that we can make things a little better every time by listening to each other’s stories.
Jeli is the toolset I’ve been wanting to build for a long time, an opportunity I couldn’t pass up. With books and PDFs, conference talks and dissertations, chats, interviews, and much more to pore over, gaining this deep understanding of the socio-technical systems we traverse every day has been a long road and can seem quite daunting. Resilience Engineering, this practice of learning what makes our daily systems adaptable to the unknown, can also feel like a super power once you tap into it. So how do you get started?
Fortunately, there’ a secret to getting the same running start I did:
- Start your career in the 2000’s as a software engineer bouncing between a few gigs, learning from slips and trips as to what good software development is (and isn’t).
- Join Etsy in 2011 as an infra dev on the Core team.
- Sit across from John Allspaw for several years as he developed the Etsy Debriefing Guide and in doing so absorb through osmosis everything he’s synthesized from a continuous stream of PDFs.
This may be less practical until we develop time machines and can clone Allspaw to scale. It also bears saying that I’ve been incredibly lucky to land where I have. When you’re so close to a movement in tech, it’s easy to see clearly how important it is and to jump in feet first. On call doesn’t have to be painful. Incidents don’t have to feel like a game of hot potato to see who is left stuck with the blame. We can use failures as a path to success. The question is then how do we get on that path?
Questions to Answer
Tech has recognized for a while now the value in examining the timelines of events that come together to produce what we loosely define an incident. Thoughtful retrospectives aimed at giving folks time to learn often fall short, though, in favor of quick fixes. We’ll bump our SLO, add redundancy to the cluster, and so on to guarantee this will definitely never happen again – until it does. We look towards the machines and what their needs are when we should instead be looking to the folks at the sharp end.
There’s more to the incident analysis process than just throwing folks into a room and hoping they produce insights. Sometimes we get lucky and people pick up a few ideas, but more often than not we’re lost in the drudgery of walking over the events of an incident or coworkers who want to band aid over systemic issues in favor of moving on. I know because I’ve been there, stuck in the same pitfalls as many eager folks now.
Eventually, you develop a rhythm. There are core questions to ask, inflection points in timelines that you understand where to dive deeper:
- How do we focus on the people part of the socio-technical system? Too often we focus on patching code to add robustness while leaving out the reason we do all of this – to improve people’s lives.
- When does the learning happen? We may interview folks or highlight some notable changes that help to make up an incident, but have people taken any real knowledge away from it?
- How can we streamline the toilsome parts of the incident analysis? It’s tough enough to think about how to approach emotionally charged situations. It shouldn’t be difficult to gather information.
- Where are the tools that remind us of where the learning happened from previous incidents? It’s easy to file away a document, never to look at it again. Good tools will leave breadcrumb trails to easily source other artifacts.
- How do we communicate shared knowledge? Folks are unsure of where to start. We need guides to bootstrap the learning process without resorting to stale questions.
Shallow work leads to shallow answers, but there’s a reason why we’re not doing more. With time constraints, missing guide posts, and a lack of tooling, how can we expect folks to do more?
“There’s got to be a better way”
Here’s the tough part. We’ve been talking about blameless post mortems, incident analysis, retrospectives, and other synonymous terms for gaining understanding when things go wrong for quite some time. While there are lots of good guides out there on how to push forward in these areas, we are seriously lacking in practical real world tooling to make it happen. This has been nearly a decade in the making for me, something I’ve wanted for the countless retrospectives I’ve run. There’s a better way in sharing information that doesn’t fall into the traps of dead end shortcuts. We can do the hard work of teasing out the threads from our collective understanding of incidents, with tools to make it so much smoother.
We’ve been laying the groundwork for years, giving talks and writing blog posts as well as instilling these ideas in local workspaces to propagate these ideas. I came to Jeli because I want to build the software that’s going to answer the above questions. I also want to work with folks who are just as eager to solve these same problems, possessing both the talent and the drive to make it happen. I get to work with Dr. Laura Maguire as she explores cognitive work in high tempo, high stress environments. Beth Adele Long writes case studies with Dr. Richard Cook on adaptive capacity sharing and Randall Koutnik explores resilience adoption in UX. All are kind, caring, and brilliant folks (in that order!) and led by Nora Jones who quite literally wrote the book on Chaos Engineering.
This is the team to solve people problems. I’m exceptionally proud to be a part of Jeli, as I know we’re going to change things for the better. I got into software development because I wanted practical, real world solutions to the everyday problems we have. Jeli is going to do just that.