Published Sep 3, 2019

SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

John Allspaw explores the complexities of system failures, emphasizing the significance of designing resilient systems and learning from post-mortem evaluations. He challenges conventional notions of human error, advocating for a nuanced understanding and the essential role of testing within production environments to preempt potential failures.
Episode Highlights
Software Engineering Radio - the podcast for professional software developers logo

Popular Clips

Questions from this episode

Episode Highlights

  • Production Testing

    Testing in production is a strategy that involves running tests on live systems to ensure realistic reliability evaluation. explains that this approach, similar to Netflix's Chaos Monkey, involves purposefully injecting faults into production environments to test system resilience 1. This method encourages engineers to anticipate and prepare for real-world failures, fostering a culture of proactive problem-solving. highlights the importance of making it safe for teams to experiment and learn from these tests, as it builds anticipation skills and confidence in handling unexpected issues 2.

    If you're not comfortable enough to do it in production, it just means you're not done yet.

    ---

    This mindset shift allows organizations to better prepare for and manage potential system failures.

       

    Infrastructure Challenges

    Infrastructure testing presents unique challenges due to the complexity of interactions between different system components. notes that while there are many resources for testing code, the domain of infrastructure testing is less advanced 3. emphasizes the importance of gathering diverse perspectives to understand potential failure points, as unexpected interactions can lead to system issues 4. He also mentions the value of writing tests, even if they cannot capture every possible scenario, as they provide learning opportunities and insights into system behavior.

    There are always things that we don't even know to test for.

    ---

    This underscores the need for continuous learning and adaptation in testing practices.

Related Episodes