Published Sep 3, 2019

SE-Radio Episode 285: James Cowling on Dropbox’s Distributed Storage System

James Cowling delves into Dropbox's massive infrastructure shift from Amazon S3 to their proprietary distributed storage system, Magic Pocket, uncovering the architectural innovations, logistical hurdles, and cutting-edge redundancy techniques that enabled seamless data migration and high durability.
Episode Highlights
Software Engineering Radio - the podcast for professional software developers logo

Popular Clips

Episode Highlights

  • Zone Redundancy

    James Cowling explains Dropbox's approach to data redundancy through the use of geographic zones. Each zone operates autonomously, ensuring that data is stored in at least two zones to protect against large-scale disasters like natural calamities or operational errors 1. Within these zones, Dropbox employs algorithms similar to Reed-Solomon to maintain data redundancy with minimal storage overhead. This approach ensures durability without excessive use of storage resources 1. Cowling emphasizes the importance of maintaining 100% correctness in data storage, ensuring that every piece of data is valid and uncorrupted 2.

    We make sure that all data is in at least two zones. The two zones run different versions of code.

    ---

    This meticulous approach to data management allows Dropbox to handle numerous hard drive failures efficiently while maintaining data integrity.

       

    Automated Management

    Automated disk management is a cornerstone of Dropbox's disaster recovery strategy. Cowling describes how the system automatically detects and re-replicates data when disk failures occur, eliminating the need for manual intervention 3. This hands-off approach allows the system to manage itself, ensuring high durability and quick recovery from failures. The process involves marking failed disks as gone only after data is fully re-replicated, preventing any dependency on faulty hardware 3.

    If we lose a disk on a long weekend, we don't want to be vulnerable to data loss because we're waiting for an operator to come around and swap it out.

    ---

    This automated system ensures that Dropbox can maintain continuous operation without risking data loss, even during extended periods of disk failure.

       

    Disaster Resilience

    Dropbox's distributed storage system is designed to withstand large-scale disasters through its zone-based redundancy and automated recovery processes. Cowling highlights that each zone is equipped to handle failures independently, ensuring that data remains accessible even if one zone is compromised 1. The system's ability to quickly re-replicate data across zones minimizes downtime and enhances reliability. This robust infrastructure is crucial for maintaining data integrity and availability, especially in the face of unexpected events 3.

    Failure is not an unusual event. When you have hundreds of thousands of disks, they fail every day.

    ---

    By leveraging these strategies, Dropbox ensures that its storage system remains resilient and efficient.

Related Episodes