SREcon24 Americas Recap

I wasn’t originally planning on a write up of SREcon24 Americas, but some ideas bubbled in my head, along with some themes, and there were a ton of solid talks (the speaker line up was wall-to-wall heavy hitters). I’m also curious how I’ll feel about my talk in the future and wanted to capture that to look back on. And then I got through a pretty long blog post! So it goes.

My talk

I don’t think it’s unreasonable to be transparent on what goes on writing a talk. It’s hard. I haven’t written one in several years, so I had the double problem of a glut of ideas to talk about and a lack of recent experience to form them into a coherent message (one that was not in written form, anyways). I took a deep dive into catching up on most videos, papers, and review of books that I’ve either had on the top of my catch up and revisit pile. Great for chipping away at that! Which then requires sorting through 18 pages of notes. A 45 min talk minus Q&A is both way too much time for a simple idea and not nearly enough time for everything I need to say.

I ran across some ideas by Allspaw, who said, in reference to myself, “Past me was either a genius…or an asshole.” I countered that inclusive “or” is always a possibility.

It was an ambitious talk and to attempt to put it in perspective, I think it was a solid talk if not flawed in a few places. Time, energy, and the like being what they are, it’s somewhat focused though examples could be stronger and definitions tighter, the idea not crystallizing into the “flow” until Monday night and a few solo rehearsals. The talk is there, for sure – we don’t look at resilience in day-to-day work enough, but we also have a dearth of how to approach that outside of incidents and their corresponding retro’s. I could see removing a concept for a “less is more” approach might have improved things – trying to fit “Graceful Extensibility” into a single slide probably did it a disservice along with other ideas that could be strengthened by connecting them into more every day work.

This is all to say it went great. Eventually. The central divider between rooms failing to close meant delays/shuffling of people, along with a laptop A/V failure (thanks again to Alex Elman for the backup!). Solid questions after and I can feel like there’s the beginnings of what we could do to continue inroads into getting folks to understand this work, beyond only just the incidents and without the need for reading through all the literature to get to the applicable things in their day to day work. Hoping to post video from the talk in a few weeks when Usenix has it up.

It’s nice to be back at it! And still enjoying it, as well. More space for ideas to breathe in talks feels like a solid takeaway for next time.

Other fantastic talks I was able to sit in

At best you’re only able to attend slightly fewer than half and notes are difficult, but a few notable talks I attended that were a fantastic reminder of why SREcon puts together such a great conference. I didn’t take notes in the moment during the talks, so this is mostly from memory so excuse any gaps in them but I highly recommend catching them when Usenix puts their videos up as well.
Note: There were other great talks that I wasn’t able to gather notes on or see because of the annoyance of only being able to occupy one place in space-time. Go seek them out too!

20 Years of SRE: Highs and Lows – Niall Murphy
It’s hard to say things like “Niall knows his stuff” – if you’ve been doing anything with SRE, he’s almost certainly somewhere in the chain having influenced it. It’s a solid reminder that the outside of an org typically looks a lot cleaner than the inside, even for Google, and that it wasn’t always the smoothest road to developing something as complex as the concept of Site Reliability Engineering.

The Ticking Time Bomb of Observability Expectations – David Caudill
Apparently this was David’s first talk, which surprised the hell out of me. Solid presentation, an excellent job presenting deep references without it feeling weighty or costing attention to process. I feel like we were supposed to get over the “monitor everything!” problem five years ago, but it’s true, we still need to push back on that. Weaving in resilience concepts with observability feels like an area we’ve touched on, but could use so much more attention, so I’m glad to see David taking a swing at it.

99.99% of Your Traces Are (Probably) Trash – Paige Cruz
When you see a talk and that panicked sweat of “oh…oh I’m getting so much wrong” and then there’s the quick understanding of how to fix it – that was me here. A (seemingly!) effortless presentation of core topics that a lot of us who are managing the Observability in our stacks could use refreshers on.

Meeting the Challenge of Burnout – Dr. Christina Maslach
I had Dr. Maslach’s talk queued up far ahead of the conference. I remember catching her talk several years ago (I want to say it was Velocity 2015) and these are topics we’re still not learning. Knowing that yes, pandemic and worldwide challenges and we’re tired without enough sleep – all important, but burnout is more than just fatigue. The constant ask to do more with less, to push faster and accomplish more but “well, we just don’t have the budget for your requests”. That probably hits pretty hard for most (all?) of us! For my sake, this dovetails neatly into a lot of Resilience Engineering topics, so I’ll be citing this early and often.

What We Want Is 90% the Same: Using Your Relationship with Security for Fun and Profit – Lea Kissner
Lea’s right! We’re in this weird conflict so often with Security and Ops/Core/Infra/Platform/SRE/etc. on what our priorities are, when we should be on the same team. Why are we conflicted? Security is part of reliability and vice versa. I especially liked how we can negotiate the overlap to make them happen as part of our ongoing roadmaps.

Thawing the Great Code Slush – Maude Lemaire
Maude’s talk was just before mine, which is always nice to have a distraction rather than sitting in nerves before going on stage. The down side – Maude’s talk was so good which makes following up a challenge. Similar vibe to Niall’s where the work to make the everyday happen even within an org like Slack and you expect it to “just” be successful, but it takes people refining the process (in this case, what code reviews looks like). That’s a huge success story – acknowledging a process that is clunky and hurting development (for the sake of safety!) and then workshopping ideas to spread out that (basically – educate everyone into Code Advisory Board or CAB stuff!).

What is Incident Severity but a Lie Agreed Upon? – Em Ruppe
(Full disclosure – Em and I work at the same company). Em has such a talent for sharing concepts that make it interesting for folks to get engaged. And this is one of those rare talents that I do love so much – how do you share an idea that is likely 1) going to be a challenge to folks core beliefs (in this case, severity is essential to incidents) 2) do it without feeling the need to lecture at folks, notably with a lot of academic references and 3) make it entertaining as hell so those ideas really stick. This feels like a talk that’ll be referenced for a long while.

Hard Choices, Tight Timelines: A Closer Look at Skip-level Tradeoff Decisions during Incidents – Dr. Laura Maguire and Courtney Nash
Speaking of tying thought provoking ideas with the value of an enjoyable talk, Dr. Laura and Courtney were able to focus down some ideas and bring in both academics (Dr. Laura’s research) and real world examples (Courtney’s from The Void). That’s what we need to do more of when we’re bringing ideas from other industries – making them feel real and concrete.

Teaching SRE – Mikey Dickerson
This talk caught me flatfooted (in the best way possible), as I was expecting a step by step on how to go about teaching folks how to do SRE. Instead, picking up the topic of “here’s my course on teaching junior/senior undergrads” and, again, making it entertaining as hell was just expertly done, with this thread of humor into what we take for granted all the time – things like “installing software is hard when you’re first setting out” and “our services need oncall even during vacation!?”. I’m amazed he’s also dropping papers from Allspaw, Woods, and Cook and deep topics like adaptive capacity and resilience on them as well – maybe that’s a note we should integrate those ideas earlier in learning. In one later week of the course, they write up a postmortem doc (!), which begs the question – why don’t we have folks do that early in their careers more often?

Storytelling as an Incident Management Skill – Laura de Vesine, Datadog, Inc
Narratives! I’m 100% on board with the title alone. So much of our incident reviews are just reading out what chat messages went through and maybe some challenges that exist outside so that we can then generate action items. That’s not how ideas stick for people, though! You don’t remember “the incident where I ran such-and-such command”, it’s “the time I had to get into that one PII server I don’t normally access because…’. Building a logical progression of the incident so that we can see parts of it as we’re in the incident as well.

Real Talk: What We Think We Know — That Just Ain’t So – John Allspaw
Productive skepticism is the big phrase that I wanted to pull from this. Things like lines of code and MttR and various other ideas around how we’re working with our software engineering, And he calls into question “are we doing ‘scientific work’ as part of it?”, answering in the affirmative in that we are informing the work with our incidents in deep academic studies (“working with the big leagues” as he said). Things look clear in describing these linear models because we’re contorting how we’re looking at the systems to fit the model. I would hate to spoil all the ideas busted during this so I’ll just say it’s a must watch.

What Can You See from Here? – Tanya Reilly
I’m glad Tanya was last because I don’t know how anyone could’ve followed it. I take back what i just said – Tanya’s is the one “must watch”. Notable quotes include “How do we notice our own bubble without people throwing beer bottles at us?”, “it’s normal we get into arguments, because it’s zoomed right up on us”. and “we’re bikeshedding all the time, getting into the nuance of things in long threads” while people outside that context ask “Why do you care?”. The answer is “Because we’re up close!” and being up close sees the beauty, the important parts of it.

It’s interesting because Tanya is so right about us feeling so passionate about an idea we fight – and we’re wrong (sometimes?) about it, because it’s hard to see past what is right up in our face (which is something we’re putting there). “When you’re zoomed up on a culture, you can miss what’s wrong about it”. Tanya describes how the world becomes the office (your relationships, what you’re focusing in) and how you miss out on things outside it. It’s humorous and thought-provoking and, if I can say, a little heartbreaking when reflecting conflict in our work.

Themes

I jotted down a few themes as well that seemed to run through many of the talks, not explicitly laid out but stuff to chew on all the same.

Reliability as a term feels much more stable that Resilience.
People better understand Reliability. In my talk I found myself making sure folks knew what I meant by Resilience and a few other key terms, which took up a significant portion of said talk.

SLO’s too seem well defined
I saw fewer talks on SLOs, which also might mean that has more or less solidified in people’s minds, even if they need new tooling or a refresher on it. It’s less in question.

This also means, IMHO, at least a few things:

  1. We can still shape our ideas on Resilience. It’s very new to tech (~20 years or so as a practice, maybe ~10 in software engineering). There’s a ton of work left, but there’s also open fields to explore ideas and we should be receptive to that.
  2. The biggest question I heard was “Yeah, I want to do this, but how do I sell this up the chain to my manager/VP/CTO?”. Other industries have been able to be better successful, so it would warrant seeing what they’ve done to make that stick better. We need better descriptions if we’re not going to rely on metrics (and we can’t!).
  3. There is a ton of opportunity for folks who are still new to these concepts and it’s up to us currently working in this space to make it inclusive for them to join, to share their expertise and abilities. Also to let them make mistakes without slapping them on the hand.

That malleability in topics on Resilience, though, also means the speakers on this topic were much more “Yes! And…”-ing each other with their ideas, dovetailing with one another in a coordination you’d think was scripted.

Migrations within complex systems are still great talk fodder
People want to hear about how things collapsed and came back or how within our complex systems they were able to change technologies, infrastructure, and data patterns while still running.  I think folks understand most of the ways but we’re searching for a “secret sauce” that makes it work better, and not the technical side. There’s a reason talks like these are so frequent and have been for many years.

Less infighting about SRE as a title
Several folks may have joked about it, but it didn’t have the same tone of “no, this is what we mean, and other folks using that title are wrong” in the talks. Are people tired of the back-and-forth, instead ready to just do the thing? Maybe it’s just the online infighting that makes it seem like there’s more discrepancy than there really is.

SRE as a concept still exists, even if it is changing on the edges
Platform Engineering seemed to be a thing for a minute but I don’t know if it caught fire. DevOps has its area as well, though perhaps not as consistently anymore. I think it may deprecate in the future (who can say if it’ll be here forever?) but I don’t see SRE concepts going away in the next year or two.

AI didn’t show up
There were one or two talks centered on it, but it wasn’t as omni-present as I’d expected. The exhibitors hall had plenty of signs selling AI-powered tooling, but even that didn’t seem quite as in your face. I have theories why, but they don’t feel like they hold up to much scrutiny, so expect these to be mostly speculative.

  • The tools are still being worked on. They’re for sure being sold by companies, but they’re still too immature to use regularly or have been practiced on and turned into a talk.
  • Folks don’t know how to use it. Anything demonstrated is inconsistent or unreliable such that folks are not as invested as those who are trying to sell it.
  • There’s a lack of interest. I don’t have a good enough pulse on the Yea/Nay for software engineers on this. Social media is oddly fractured in ways I can’t quite grasp. Folks may want to, but for now they’re cautious or can see through so much hype that it’s turned them off.
  • I’m biased. I’m not a big fan of it in any practice and have simply missed out on the conversations or intentionally downplaying. This is totally my hot take here! The folks interested and having those conversations may have done it outside my earshot (almost certainly) and I self-selected out.

Social Media is lying to us. Perhaps partially true (this is evergreen) but the hype cycle is higher than in practice. Companies want to all assume they’re in, and the gold rush is stuff churning. Will we see that in six months time, a year, two years? I would expect it to be around but I also don’t think the hooks are securely in place for the folks “at the sharp end” using AI powered tools every day yet.

Photo: https://www.pexels.com/photo/the-sun-is-setting-over-a-building-with-a-large-white-structure-20354065/

Leave a Reply

Your email address will not be published. Required fields are marked *