SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

Topics covered
Popular Clips
Questions from this episode
- Asked by 28 people
- Asked by 2 people
Episode Highlights
Production Testing
Testing in production is a strategy that involves running tests on live systems to ensure realistic reliability evaluation. explains that this approach, similar to Netflix's Chaos Monkey, involves purposefully injecting faults into production environments to test system resilience 1. This method encourages engineers to anticipate and prepare for real-world failures, fostering a culture of proactive problem-solving. highlights the importance of making it safe for teams to experiment and learn from these tests, as it builds anticipation skills and confidence in handling unexpected issues 2.
If you're not comfortable enough to do it in production, it just means you're not done yet.
---
This mindset shift allows organizations to better prepare for and manage potential system failures.
Infrastructure Challenges
Infrastructure testing presents unique challenges due to the complexity of interactions between different system components. notes that while there are many resources for testing code, the domain of infrastructure testing is less advanced 3. emphasizes the importance of gathering diverse perspectives to understand potential failure points, as unexpected interactions can lead to system issues 4. He also mentions the value of writing tests, even if they cannot capture every possible scenario, as they provide learning opportunities and insights into system behavior.
There are always things that we don't even know to test for.
---
This underscores the need for continuous learning and adaptation in testing practices.
Related Episodes


SE-Radio Episode 301: Jason Hand Handling Outages
Answers 383 questions

SE-Radio Episode 325: Tammy Butow on Chaos Engineering
Answers 383 questions

SE Radio 637: Steve Smith on Software Quality
Answers 383 questions

SE Radio 572: Gregory Kapfhammer on Flaky Tests
Answers 383 questions

SE-Radio Episode 242: Dave Thomas on Innovating Legacy Systems
Answers 383 questions

SE-Radio Episode 256: Jay Fields on Working Effectively with Unit Tests
Answers 383 questions

SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Answers 383 questions

SE-Radio-Episode-280-Gerald-Weinberg-on-Bugs-Errors-and-Software-Quality
Answers 383 questions

SE-Radio Episode 344: Pat Helland on Web Scale
Answers 383 questions
SE-Radio Episode 332: John Doran on Fixing a Broken Development Process
Answers 383 questions

SE-Radio Episode 295: Michael Feathers on Legacy Code
Answers 383 questions

SE-Radio Episode 357: Adam Barr on Code Quality
Answers 383 questions

SE-Radio Episode 247: Andrew Phillips on DevOps
Answers 383 questions













