Published Sep 3, 2019

SE-Radio Episode 301: Jason Hand Handling Outages

Jason Hand delves into the significance of blameless reviews and strategic monitoring in incident management, emphasizing how actionable alerts and robust data collection can enhance response to outages. The episode also explores strategies to prevent team burnout, encouraging sustainable work practices and empowering IT engineers for balanced team dynamics.
Episode Highlights
Software Engineering Radio - the podcast for professional software developers logo

Popular Clips

Episode Highlights

  • Alerting

    Effective alerting is crucial in minimizing alert fatigue among engineers. emphasizes the importance of setting actionable alerts, suggesting that each alert should be accompanied by a runbook to guide even the most junior team member through the necessary steps to address the issue 1. He notes that alert fatigue often arises from non-actionable alerts, which can desensitize engineers to critical issues 2.

    An alert by itself isn't often all that helpful.

    ---

    adds that reviewing alert thresholds is essential to ensure they are set appropriately to prevent unnecessary disruptions 1.

       

    Monitoring

    Monitoring best practices involve capturing comprehensive data to improve incident responses. explains that while over-monitoring is rare, over-alerting can be problematic, and the true value of monitoring lies in data collection for post-incident analysis 3. He stresses the importance of having the right information and access to systems when an alert is triggered, to ensure timely and effective responses.

    Reducing that time to detect and that time to know that there's a problem is definitely one of the early challenges.

    ---

    also highlights the role of monitoring tools in detecting issues, such as spikes in CPU usage or disk space shortages, which are critical for maintaining system health 4.

Related Episodes