SE-Radio Episode 301: Jason Hand Handling Outages

Topics covered
Popular Clips
Episode Highlights
Alerting
Effective alerting is crucial in minimizing alert fatigue among engineers. emphasizes the importance of setting actionable alerts, suggesting that each alert should be accompanied by a runbook to guide even the most junior team member through the necessary steps to address the issue 1. He notes that alert fatigue often arises from non-actionable alerts, which can desensitize engineers to critical issues 2.
An alert by itself isn't often all that helpful.
---
adds that reviewing alert thresholds is essential to ensure they are set appropriately to prevent unnecessary disruptions 1.
Monitoring
Monitoring best practices involve capturing comprehensive data to improve incident responses. explains that while over-monitoring is rare, over-alerting can be problematic, and the true value of monitoring lies in data collection for post-incident analysis 3. He stresses the importance of having the right information and access to systems when an alert is triggered, to ensure timely and effective responses.
Reducing that time to detect and that time to know that there's a problem is definitely one of the early challenges.
---
also highlights the role of monitoring tools in detecting issues, such as spikes in CPU usage or disk space shortages, which are critical for maintaining system health 4.
Related Episodes


SE-Radio Episode 325: Tammy Butow on Chaos Engineering
Answers 383 questions

SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
Answers 383 questions

SE-Radio Episode 313: Conor Delanbanque on Hiring and Retaining DevOps
Answers 383 questions

SE-Radio Episode 288: DevSecOps
Answers 383 questions

SE-Radio Episode 264: James Phillips on Service Discovery
Answers 383 questionsEpisode 7: Error Handling
Answers 383 questions
SE Radio 599: Jason C. McDonald on Quantified Tasks
Answers 383 questions

Episode 134: Release It with Michael Nygard
Answers 383 questions

SE-Radio Episode 355: Randy Shoup Scaling Technology and Organization
Answers 383 questions
SE Radio 555: On Freund on Upskilling
Answers 383 questions

SE Radio 585: Adam Frank on Continuous Delivery vs Continuous Deployment
Answers 383 questions

SE-Radio Episode 247: Andrew Phillips on DevOps
Answers 383 questions

SE-Radio-Episode-309-Zane-Lackey-on-Application-Security
Answers 383 questions
Episode 78: Fault Tolerance with Bob Hanmer Pt. 2
Answers 383 questions













