Published Sep 3, 2019

SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

John Allspaw explores the complexities of system failures, emphasizing the significance of designing resilient systems and learning from post-mortem evaluations. He challenges conventional notions of human error, advocating for a nuanced understanding and the essential role of testing within production environments to preempt potential failures.

Episode Highlights

Topics covered

Questions from this episode

Why do people make mistakes, and why do I attach myself to them?
Asked by 28 people
What can be learned from breakdowns?
Asked by 2 people

Episode Highlights

Outage Contexts

, CTO of Etsy, explores the complex nature of outages and failures within organizations. He emphasizes that these events are context-specific, varying greatly between businesses based on their reliance on different technologies. For instance, a web outage might be critical for one company but negligible for another that primarily uses native apps 1.

I like to just refer them to as untoward events. And then that way it's a little bit more, I don't know, context specific.

---

highlights that even well-prepared companies like AWS and Gmail experience unexpected outages, underscoring the inherent unpredictability of these events 2.

Human Factors

Allspaw challenges the traditional notion of human error in system failures, arguing that it oversimplifies the complexity of these events. He believes that labeling mistakes as human error ignores the broader context in which decisions are made, such as design flaws or inadequate information 3.

People do what makes sense to them at the time, given their goals, their familiarity with this scenario.

---

He supports the idea that both success and failure stem from the same cognitive processes, suggesting that searching for a single root cause is often misleading 4.

Related Episodes

SE-Radio Episode 301: Jason Hand Handling Outages
Answers 383 questions
SE-Radio Episode 325: Tammy Butow on Chaos Engineering
Answers 383 questions
SE Radio 637: Steve Smith on Software Quality
Answers 383 questions
SE Radio 572: Gregory Kapfhammer on Flaky Tests
Answers 383 questions
SE-Radio Episode 242: Dave Thomas on Innovating Legacy Systems
Answers 383 questions
SE-Radio Episode 256: Jay Fields on Working Effectively with Unit Tests
Answers 383 questions
SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Answers 383 questions
SE-Radio-Episode-280-Gerald-Weinberg-on-Bugs-Errors-and-Software-Quality
Answers 383 questions
SE-Radio Episode 344: Pat Helland on Web Scale
Answers 383 questions
SE-Radio Episode 332: John Doran on Fixing a Broken Development Process
Answers 383 questions
SE-Radio Episode 295: Michael Feathers on Legacy Code
Answers 383 questions
SE-Radio Epislode 250: Jürgen Laartz and Alexander Budzier on Why Large IT Projects Fail
Answers 383 questions
SE-Radio Episode 357: Adam Barr on Code Quality
Answers 383 questions
SE-Radio Episode 237: Software Engineering Radio: Go Behind the Scenes and Meet the Team
Answers 383 questions
SE-Radio Episode 247: Andrew Phillips on DevOps
Answers 383 questions

SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

Topics covered

Popular Clips

Questions from this episode

Episode Highlights

Failure ResponsesJohn Allspaw discusses the importance of designing systems for recovery and learning from failures. He highlights strategies for building resilient systems and the value of post-mortem evaluations in understanding and improving system performance.

Failure Responses

Testing ChallengesJohn Allspaw discusses the necessity of testing in production environments to ensure system reliability. He shares insights on the challenges of infrastructure testing and the importance of learning from diverse perspectives to anticipate potential failures.

Testing Challenges

System Failures Understanding

Outage Contexts

Human Factors

Related Episodes