Published Sep 3, 2019

SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

John Allspaw explores the complexities of system failures, emphasizing the significance of designing resilient systems and learning from post-mortem evaluations. He challenges conventional notions of human error, advocating for a nuanced understanding and the essential role of testing within production environments to preempt potential failures.
Episode Highlights
Software Engineering Radio - the podcast for professional software developers logo

Popular Clips

Questions from this episode

Episode Highlights

  • Outage Contexts

    , CTO of Etsy, explores the complex nature of outages and failures within organizations. He emphasizes that these events are context-specific, varying greatly between businesses based on their reliance on different technologies. For instance, a web outage might be critical for one company but negligible for another that primarily uses native apps 1.

    I like to just refer them to as untoward events. And then that way it's a little bit more, I don't know, context specific.

    ---

    highlights that even well-prepared companies like AWS and Gmail experience unexpected outages, underscoring the inherent unpredictability of these events 2.

       

    Human Factors

    Allspaw challenges the traditional notion of human error in system failures, arguing that it oversimplifies the complexity of these events. He believes that labeling mistakes as human error ignores the broader context in which decisions are made, such as design flaws or inadequate information 3.

    People do what makes sense to them at the time, given their goals, their familiarity with this scenario.

    ---

    He supports the idea that both success and failure stem from the same cognitive processes, suggesting that searching for a single root cause is often misleading 4.

Related Episodes