Published Sep 3, 2019

SE-Radio Episode 285: James Cowling on Dropbox’s Distributed Storage System

James Cowling delves into Dropbox's massive infrastructure shift from Amazon S3 to their proprietary distributed storage system, Magic Pocket, uncovering the architectural innovations, logistical hurdles, and cutting-edge redundancy techniques that enabled seamless data migration and high durability.

Episode Highlights

Topics covered

Episode Highlights

Zone Redundancy

James Cowling explains Dropbox's approach to data redundancy through the use of geographic zones. Each zone operates autonomously, ensuring that data is stored in at least two zones to protect against large-scale disasters like natural calamities or operational errors 1. Within these zones, Dropbox employs algorithms similar to Reed-Solomon to maintain data redundancy with minimal storage overhead. This approach ensures durability without excessive use of storage resources 1. Cowling emphasizes the importance of maintaining 100% correctness in data storage, ensuring that every piece of data is valid and uncorrupted 2.

We make sure that all data is in at least two zones. The two zones run different versions of code.

---

This meticulous approach to data management allows Dropbox to handle numerous hard drive failures efficiently while maintaining data integrity.

Automated Management

Automated disk management is a cornerstone of Dropbox's disaster recovery strategy. Cowling describes how the system automatically detects and re-replicates data when disk failures occur, eliminating the need for manual intervention 3. This hands-off approach allows the system to manage itself, ensuring high durability and quick recovery from failures. The process involves marking failed disks as gone only after data is fully re-replicated, preventing any dependency on faulty hardware 3.

If we lose a disk on a long weekend, we don't want to be vulnerable to data loss because we're waiting for an operator to come around and swap it out.

---

This automated system ensures that Dropbox can maintain continuous operation without risking data loss, even during extended periods of disk failure.

Disaster Resilience

Dropbox's distributed storage system is designed to withstand large-scale disasters through its zone-based redundancy and automated recovery processes. Cowling highlights that each zone is equipped to handle failures independently, ensuring that data remains accessible even if one zone is compromised 1. The system's ability to quickly re-replicate data across zones minimizes downtime and enhances reliability. This robust infrastructure is crucial for maintaining data integrity and availability, especially in the face of unexpected events 3.

Failure is not an unusual event. When you have hundreds of thousands of disks, they fail every day.

---

By leveraging these strategies, Dropbox ensures that its storage system remains resilient and efficient.

Related Episodes

SE Radio 619: James Strong on Kubernetes Networking
Answers 383 questions
SE Radio 592: Jaxon Repp on Distributed Data Infrastructure
Answers 383 questions
Episode 217: James Turnbull on Docker
Answers 383 questions
SE-Radio Episode 264: James Phillips on Service Discovery
Answers 383 questions
SE Radio 571: Jeroen Mulder on Multi-Cloud Governance
Answers 383 questions
SE-Radio-Episode-259:-John-Purrier-on-OpenStack
Answers 383 questions
Episode 369: Derek Collison on Messaging Systems and NATS
Answers 383 questions
SE Radio 560: Sugu Sougoumarane on Distributed SQL Databases
Answers 383 questions
Episode 498: James Socol on Continuous Integration and Continuous Delivery (CICD)
Answers 383 questions
SE-Radio Episode 354: Avi Kivity on ScyllaDB.mp3
Answers 383 questions
Episode 216: Adrian Cockcroft on the Modern Cloud-based Platform
Answers 383 questions
SE Radio 631: Abhay Paroha on Cloud Migration for Oil and Gas Operations
Answers 383 questions
SE-Radio Episode 334: David Calavera on Zero-downtime Migrations and Rollbacks with Kubernetes
Answers 383 questions
SE-Radio Episode 314: Scott Piper on Cloud Security
Answers 383 questions
SE-Radio-Episode-261:-David-Heinemeier-Hansson-on-the-State-of-Rails,-Monoliths,-and-More
Answers 383 questions

SE-Radio Episode 285: James Cowling on Dropbox’s Distributed Storage System

Topics covered

Popular Clips

Episode Highlights

Scaling InfrastructureJames Cowling of Dropbox shares insights into their transition from Amazon's S3 to a proprietary distributed storage system. He details the logistical, network, and architectural challenges faced and the strategies employed to overcome them.

Scaling Infrastructure

Migration to Magic PocketJames Cowling discusses Dropbox's strategic migration from Amazon S3 to their proprietary storage system, Magic Pocket. He highlights the importance of seamless transition strategies, operational challenges, and data integrity measures in ensuring a successful migration.

Migration to Magic Pocket

Data Redundancy Techniques

Zone Redundancy

Automated Management

Disaster Resilience

Related Episodes