SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

Topics covered
Popular Clips
Questions from this episode
- Asked by 28 people
- Asked by 2 people
Episode Highlights
Outage Contexts
, CTO of Etsy, explores the complex nature of outages and failures within organizations. He emphasizes that these events are context-specific, varying greatly between businesses based on their reliance on different technologies. For instance, a web outage might be critical for one company but negligible for another that primarily uses native apps 1.
I like to just refer them to as untoward events. And then that way it's a little bit more, I don't know, context specific.
---
highlights that even well-prepared companies like AWS and Gmail experience unexpected outages, underscoring the inherent unpredictability of these events 2.
Human Factors
Allspaw challenges the traditional notion of human error in system failures, arguing that it oversimplifies the complexity of these events. He believes that labeling mistakes as human error ignores the broader context in which decisions are made, such as design flaws or inadequate information 3.
People do what makes sense to them at the time, given their goals, their familiarity with this scenario.
---
He supports the idea that both success and failure stem from the same cognitive processes, suggesting that searching for a single root cause is often misleading 4.
Related Episodes


SE-Radio Episode 301: Jason Hand Handling Outages
Answers 383 questions

SE-Radio Episode 325: Tammy Butow on Chaos Engineering
Answers 383 questions

SE Radio 637: Steve Smith on Software Quality
Answers 383 questions

SE Radio 572: Gregory Kapfhammer on Flaky Tests
Answers 383 questions

SE-Radio Episode 242: Dave Thomas on Innovating Legacy Systems
Answers 383 questions

SE-Radio Episode 256: Jay Fields on Working Effectively with Unit Tests
Answers 383 questions

SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Answers 383 questions

SE-Radio-Episode-280-Gerald-Weinberg-on-Bugs-Errors-and-Software-Quality
Answers 383 questions

SE-Radio Episode 344: Pat Helland on Web Scale
Answers 383 questions
SE-Radio Episode 332: John Doran on Fixing a Broken Development Process
Answers 383 questions

SE-Radio Episode 295: Michael Feathers on Legacy Code
Answers 383 questions

SE-Radio Episode 357: Adam Barr on Code Quality
Answers 383 questions

SE-Radio Episode 247: Andrew Phillips on DevOps
Answers 383 questions













