I just read this article and paper on Facebooks’s new storage system. The key details are that the new ”[f4] storage systems uses Reed-Solomon coding and lays blocks out on different racks to ensure resilience to disk, machine, and rack failures within a single data center. It uses XOR coding in the wide-area to ensure resilience to data center failures.”
The article goes on to state that f4 has been running in production at Facebook for over 19 months: “f4 currently stores over 65PB of logical data and saves over 53PB of storage.” Well, let’s ask a question about the amount of time and network bandwidth required to do this. Can most companies afford this?
I commend Facebook for coming up with this idea. But I am not sure it will work for most companies, given the required networking bandwidth and latency and for the replication of the data, in case of a data center failure. It is also not surprising that Facebook did not go with a commercial solution for their problem, even though there are multiple commercial solutions for data center distributed Reed-Solomon encoded replicated data. It’s likely Facebook has special case requirements and engineers to do the work, which is not the case for most of the rest of the world.
The question is: will cloud replication methods be driving the storage industry and the large companies that current dominate the industry? Or will the industry step up and provide something that meets the future requirements for most companies?
Facebook’s network bandwidth and the cloud industry’s network bandwidth are far greater than the average company, but these companies want and need the similar reliability. Disk drive AFR has increased dramatically over the years but the hard error rate has not changed in almost 10 years. And the silent data corruption rate of the channels connecting the disk drives has not change in even longer. Storage companies, Facebook just issued you a challenge.