Published Sep 3, 2019

SE-Radio Episode 301: Jason Hand Handling Outages

Jason Hand delves into the significance of blameless reviews and strategic monitoring in incident management, emphasizing how actionable alerts and robust data collection can enhance response to outages. The episode also explores strategies to prevent team burnout, encouraging sustainable work practices and empowering IT engineers for balanced team dynamics.

Episode Highlights

Topics covered

Episode Highlights

Alerting

Effective alerting is crucial in minimizing alert fatigue among engineers. emphasizes the importance of setting actionable alerts, suggesting that each alert should be accompanied by a runbook to guide even the most junior team member through the necessary steps to address the issue 1. He notes that alert fatigue often arises from non-actionable alerts, which can desensitize engineers to critical issues 2.

An alert by itself isn't often all that helpful.

---

adds that reviewing alert thresholds is essential to ensure they are set appropriately to prevent unnecessary disruptions 1.

Monitoring

Monitoring best practices involve capturing comprehensive data to improve incident responses. explains that while over-monitoring is rare, over-alerting can be problematic, and the true value of monitoring lies in data collection for post-incident analysis 3. He stresses the importance of having the right information and access to systems when an alert is triggered, to ensure timely and effective responses.

Reducing that time to detect and that time to know that there's a problem is definitely one of the early challenges.

---

also highlights the role of monitoring tools in detecting issues, such as spikes in CPU usage or disk space shortages, which are critical for maintaining system health 4.

Related Episodes

SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From
Answers 383 questions
SE-Radio Episode 325: Tammy Butow on Chaos Engineering
Answers 383 questions
SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Answers 383 questions
SE-Radio Episode 313: Conor Delanbanque on Hiring and Retaining DevOps
Answers 383 questions
SE-Radio Episode 288: DevSecOps
Answers 383 questions
SE-Radio Episode 264: James Phillips on Service Discovery
Answers 383 questions
Episode 7: Error Handling
Answers 383 questions
SE Radio 599: Jason C. McDonald on Quantified Tasks
Answers 383 questions
Episode 134: Release It with Michael Nygard
Answers 383 questions
SE-Radio Episode 355: Randy Shoup Scaling Technology and Organization
Answers 383 questions
SE Radio 555: On Freund on Upskilling
Answers 383 questions
SE Radio 585: Adam Frank on Continuous Delivery vs Continuous Deployment
Answers 383 questions
SE-Radio Episode 247: Andrew Phillips on DevOps
Answers 383 questions
SE-Radio-Episode-309-Zane-Lackey-on-Application-Security
Answers 383 questions
Episode 78: Fault Tolerance with Bob Hanmer Pt. 2
Answers 383 questions

Dexa/Software Engineering Radio - the podcast for professional software developers

SE-Radio Episode 301: Jason Hand Handling Outages

Topics covered

Popular Clips

Engaging Conversations

API Outages and Transparency

Embracing Mistakes

On-Call Responsibility

Alert Management Strategies

Final Thoughts

Managing Disruptions

Diagnostic Techniques

Engaging with Listeners

Disk Space Monitoring

Incident Analysis Insights

Monitoring System Health

Learning from Failure

Incident Management Essentials

Episode Highlights

Learning from Failure

Monitoring and Alerting

Alerting

Monitoring

Team Organization

Related Episodes

SE-Radio Episode 284: John Allspaw on System Failures: Preventing, Responding, and Learning From

SE-Radio Episode 325: Tammy Butow on Chaos Engineering

SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering

SE-Radio Episode 313: Conor Delanbanque on Hiring and Retaining DevOps

SE-Radio Episode 288: DevSecOps

SE-Radio Episode 264: James Phillips on Service Discovery

Episode 7: Error Handling

SE Radio 599: Jason C. McDonald on Quantified Tasks

Episode 134: Release It with Michael Nygard

SE-Radio Episode 355: Randy Shoup Scaling Technology and Organization

SE Radio 555: On Freund on Upskilling

SE Radio 585: Adam Frank on Continuous Delivery vs Continuous Deployment

SE-Radio Episode 247: Andrew Phillips on DevOps

SE-Radio-Episode-309-Zane-Lackey-on-Application-Security

Episode 78: Fault Tolerance with Bob Hanmer Pt. 2

SE-Radio Episode 301: Jason Hand Handling Outages

Topics covered

Popular Clips

Episode Highlights

Learning from FailureThe discussion shifts to the importance of blameless reviews and failure analysis in incident management. Jason Hand explains how these practices can enhance learning and improve future responses by focusing on team dynamics and business metrics rather than assigning blame.

Learning from Failure

Monitoring and AlertingJason Hand discusses strategies for effective alerting and monitoring to handle outages efficiently. He highlights the importance of actionable alerts and comprehensive data collection to improve incident response.

Monitoring and Alerting

Alerting

Monitoring

Team OrganizationThe discussion shifts to preventing burnout and empowering engineers in IT environments. Jason Hand and Bryan Reinero explore strategies for sustainable work practices and balanced team dynamics.

Team Organization

Related Episodes