Distributed Object Storage
For high-scale systems, there is tremendous value to be derived from the ability to distribute data. Distributing data can make storage scalable beyond the boundaries of a single site or location.
Moreover, distributing data with the right architecture can counter-intuitively enhance data availability. Scattering data across the internet intuitively seems as if it would subject data access to more risks of outage‐and with the wrong architecture, it can. But with the right architecture, data distribution can make data more available through increased redundancy and more paths for accessibility. It thereby offsets increased risks and complexity that arise with extreme scale and multiple location dependencies.
We’ve taken to calling storage systems that handle this object storage the right way “Distributed Object Storage.”
Currently, we generally see two approaches among object-based storage systems: full object replication and an approach rooted in “Information Dispersal Algorithms” (IDA).
IDA was a theoretical approach to storing information first proposed in 1989. It's based on erasure coding technology that far predated information dispersal. Whether discussed as information dispersal or erasure coding, the term implies that data can be distributed in sub-file chunks across many systems or system components, and then reassembled from a subset of those chunks—perhaps 7out of 8, 11 out of 15, or 9 out of 12. The number of chunks and the level of protection can be varied by application.
Erasure coding has been used in many other ways in the data storage industry (from DVDs to Bit Torrent to RAID). In high performance and low latency environments, standardized approaches have suffered from computational overhead. They remain a focal point for continued research, but in the environment of the Internet and cloud solutions, these standardized approaches excel.
Today there are several product vendors using information dispersal approaches with fairly standard algorithms, including Amplidata, Caringo, Cleversafe, EMC, HDS, Scality, and Symform.
For the purposes of distribution, erasure coding has tremendous efficiency implications in both storing and retrieving data. With erasure coding, single transactions can fully store data in a protected manner, without the extra transactional steps typically involved in replication-based models. Once stored, data can be retrieved with optimal performance—by selecting the bits from repositories with the lowest latency—and is simultaneously protected against service outage from individual storage repositories or locations.
Taneja Group has formally labeled information dispersal approaches as "Distributed Object Storage technologies." And we have labeled products that work with full replicas of objects as "Replicated Object Storage technologies."
The key capabilities of Distributed Object Storage solutions based on dispersal and erasure coding stand to have high impact in cloud application and storage systems, and they merit a closer look. Specifically, Distributed Object Storage architectures (versus Replicated Object Storage) stand to create significant differences in the most fundamental capabilities that drive customers to cloudy object storage solutions in the first place$mdash;enhanced data access, availability and data distribution.
Let’s take a look:
Efficient distribution of data. Distributed Object Storage delivers more efficient distribution of data. It can scatter sub-file chunks across multiple systems and create the parity required for reassembly from a subset of those chunks without the overhead of full object or file system replicas required for replicated object storage. Dispersal may require only 30% to 60% overhead versus 200% or 300% consumed by completely replicating objects. Moreover, the process of creating these replicas happens at the time of file creation, usually with a storage system or gateway driven operation, rather than a slower and more vulnerable post-write replication job.
Enhanced availability. Distributed Object Storage protects availability by serving up data from a subset of chunks. As a result, organizations require only a portion of the original storage systems to be available at any given time. Across the Internet, an organization with many locations can get global access to data that may surpass even Tier 1 mission critical storage arrays within any single site.
In addition, a single site or region failure may be inconsequential to the other 90% of global users with access to other data centers, and this can reduce the single points of risk. For many customers, this heightened availability justifies their pursuit of distributed object storage behind key applications.
Replicated approaches can deliver similar data availability, but with much greater transmission overhead, greater latency before reaching a protected state and with direr consequences upon accessibility during a single system failure. Typically all customers of a single system experience an outage, and have few means for automated redirection.
Enhanced and efficient access. Distributed object architectures also supply a superb answer to the challenges of accessing data across the latency-prone global Internet. By requiring only partial data fragments for complete data reassembly, Distributed Object Storage solutions can pick and choose storage locations with the best response time. This can approximate some of the capabilities of Content Delivery Networks by making sure data access is provided as close as possible to the end node requesting it.
The best such architectures also accelerate data access by intelligently placing data chunking and reassembly functions in optimal places for the best performance—often gateway type devices. They may also streamline chunking and reassembly functionality so that it can be used natively in software code and on lightweight devices. Once again, replicated approaches create dependencies on single data locales where system failures or connectivity outages can easily take data offline for sets of customers.
Finally, as an added benefit, erasure coding can enhance security, and thereby make platforms even more suitable for multi-tenancy use cases behind the cloud.
Among the solutions on the market best suited for distribution, there is still much variation. In our view, a few key areas stand out as the most critical dimensions to assess for any high scale and/or cloudy storage architecture that will cross sites and offer a long lifespan. Look to your vendors to provide sufficient depth and validation in each of these areas:
- Reliability—How reliable is the vendor’s solution for petabyte-plus data volumes? Does vendor allow flexibility in setting reliability levels? What features does vendor offer to ensure reliability (integrity checks, error correction, etc.)?
- Scalability—How large can the vendor’s solution grow—is there a limit? How easy is it to scale capacity and performance?
- Security—Security features vary by vendor—what type of encryption is offered? Is key management required? etc.
- TCO—How do vendor costs compare not only for hardware and software, but also for power, cooling, floor space, admin staffing, etc.?
Finally, many of these solutions have been in the market long enough to have a track record. Efficient Distributed Object Storage requires architectural underpinnings that best show up at scale and are hard if not impossible to evaluate in a test lab. Track records are more important than ever.