Published May 9, 2022

Site Reliability Engineering - Eliminating Toil

    Explore the elimination of toil in Site Reliability Engineering with Allen Underwood, Joe Zack, and Michael Outlaw, as they discuss innovative tools like Python's pandas and project management solutions to boost productivity and engineer satisfaction, while delving into the critical distinctions and collaborations between SRE and DevOps roles.
    Episode Highlights
    Coding Blocks logo

    Popular Clips

    Episode Highlights

    • Defining Toil

      Toil in the context of Site Reliability Engineering (SRE) is not merely work that one dislikes, but rather tasks that are manual, repetitive, and can be automated. Allen Underwood explains that toil includes tasks that grow with the service and provide no enduring value, such as manually updating WordPress sites or handling repetitive on-call duties 1 2. These tasks, if not managed, can lead to career stagnation and low morale, as they prevent engineers from engaging in meaningful projects 3.

      Toil is work that is often manual, repetitive, can be automated, has no real value, and grows as the service does.

      --- Allen Underwood

      Understanding and identifying toil is crucial for maintaining productivity and job satisfaction.

         

      Eliminating Toil

      Eliminating toil involves automating repetitive tasks and improving processes to enhance efficiency. Joe Zack highlights that automation reduces the likelihood of errors and frees up time for more strategic work 4. At Google, the aim is to keep toil below 50% of an SRE's workload, allowing the remaining time to be spent on developing solutions that improve service reliability and performance 5.

      If you spend more than 50% of your time on toil, it takes away from developers' time for more valuable work.

      --- Joe Zack

      By focusing on engineering solutions, SREs can scale services more efficiently and avoid being bogged down by mundane tasks 6.

    Related Episodes