There are a multitude of big data storage products on the market. Which ones are best? Clearly, there is no simple answer. The many variables in choosing a big data storage tool include the existing environment, current storage platform, growth expectations, size and type of files, database and application mix, among others.
Although this is far from a complete list, here are some top big data storage options to consider.
Big Data Storage: Top Contenders
Hitachi offers several routes to big data storage. Big Data Analytics with Pentaho Software, Hitachi Hyper Scale-Out Platform (HSP), HSP Technical Architecture and Hitachi Video Management Platform (VMP). The latter example is specific to the growing sub-set of big data known as big video. This addresses video surveillance and other video intensive storage applications.
Similarly, DataDirect Networks (DDN) has a collection of solutions concerning big data storage.
For example, it’s high performance SFA7700X file storage can be automatically tiered to its WOS object storage archive to support rapid collection, simultaneous analysis and cost-effective retention of big data .
“The Scripps Research Institute uses this for its cryo-electron microscopy (Cryo-EM), which is collecting more than 30 TBs of data each week in the search for cures for HIV, Ebola, Zika and major neurological diseases,” said Michael King, Senior Director of Marketing Strategy and Operations, DDN. “In the past, it could take a year or more to look at protein structure and developed antibodies. Cyro-EM completes that discovery process in weeks.”
Spectra Logic’s BlackPearl Deep Storage Gateway provides an object storage interface to SAS-based disk, SMR spin down disk or tape. Any or all of these technologies can be put behind BlackPearl in a storage environment.
An alternate platform for big data storage is offered by Kamiario. While it does not offer a classic big data appliance, its all-flash arrays are finding a home beside many big data applications.
“As developers incorporate real-time analytics into their applications, storage infrastructure strategies must be able to manage big data analytics workloads alongside traditional transaction processing workloads,” said Shachar Fienblit, Chief Technology Officer, Kaminario. “The Kaminario K2 all-flash array is built to support this dynamic workload environment.”
Caringo was founded in 2005 with the goal of unlocking the value of data and solving issues with data protection, management, organization and search at massive scale. Its flagship product, Swarm, eliminates the need to migrate data into disparate solutions for long-term preservation, delivery and analysis, thereby lowering total cost of ownership. It is already used by 400+ organizations worldwide (such as the Department of Defense, the Brazilian Federal Court System, City of Austin, Telefónica, British Telecom, Ask.com, and Johns Hopkins University).
“To simplify the ingestion of data to Swarm, we have FileFly (for Windows File Servers and NetApp Servers) and SwarmNFS (providing a fully functional NFSv4 infrastructure),” said Tony Barbagallo, Vice President of Product, Caringo.
The Infogix Enterprise Data Analysis Platform is based on five core capabilities – data quality, transaction monitoring, balancing and reconciliation, identity matching and behavior profiling, and predictive models. These capabilities are said to help companies improve operational efficiency, drive new revenue, ensure compliance and gain competitive advantages. The platform can detect data errors in real-time where they occur and apply automated end-to-end analysis to optimize the performance of big data projects.
Avere Hybrid Cloud
Yet another alternative approach to big data storage is proposed by Avere. Its Avere Hybrid Cloud gets deployed in various use cases within its hybrid cloud infrastructure. Its physical FXT clusters are for a NAS optimization use case, which utilizes an all-flash performance tier in front of existing disk-based NAS systems. FXT clusters use caching to automatically accelerate active date, cluster to scale performance (add more CPUs and DRAM) and capacity (add more SSDs), and hide the latency to the core storage, which sometimes is deployed over a WAN. Users find it a good way to accelerate the performance of rendering, genomic analysis, financial simulations, software tools and binaries repository.
In the File Storage for Private Object use case, users are looking to move from NAS to private object storage. They tend to like private object for its efficiency, simplicity, and resiliency, but don’t like its performance or its object-based API interface. In this use case, the FXT cluster accelerates the performance of the private object storage in the same way as in the NAS optimization use case.
“In addition, the FXT cluster provides familiar NAS protocols with translation to object APIs on the storage side, so users can harness object storage without rewriting their applications or changing their data access methods,” said Jeff Tabor, Senior Director of Product Management and Marketing at Avere Systems.
Finally, the Cloud Storage Gateway use case is similar to the File Storage for Private Object use case with the added benefit that enterprises can begin to build fewer data centers and move their data to the cloud. Latency is one of the challenges to be overcome in this use case and that’s what the physical FXT cluster addresses. On access, data is cached on premises on the FXT cluster so all subsequent accesses occur at low latency. An FXT cluster can have as much as 480TB of total caching capacity, so large volumes of data can be stored on premises to avoid the latency of the cloud.
Big data is generally stored on local disk, which means the logical connection between compute and storage needs to be maintained in order to achieve the efficiency and scaling as clusters sizes for big data continue to grow. The question then is: how do you disaggregate disks from the server and yet continue to provide the same logical relationship between the CPU/Memory composite and drives? How do you achieve the cost, scale and manageability efficiencies of a shared storage pool, while still providing the benefits of locality? This is said to be what DriveScale by harnessing Hadoop data stores.
However, storage professionals looking to install and manage resources for big data applications are primarily constrained by the Hadoop architecture, which is inherently optimized for local drives on servers. As the volume of data increases, the only recourse is to purchase ever-increasing numbers of servers, not only to address the compute requirements, but to also provide for higher amounts of storage capacity. DriveScale allows users to procure capacity independent of the compute capacity thus enabling right-sizing at each level.
“There is no reason why the advantages of the propriety scale-up infrastructure environments everyone is accustomed to in the data center cannot be brought to the commodity scale-out world,” said S.K. Vinod, Vice President of Product Management, DriveScale. “We give IT administrators the tools to build and run an elastic big data infrastructure where server and disk subsystems are disaggregated and re-composed on the fly as needed. Individual drives are provisioned to servers from a shared pool of JBOD attached disks, thus eliminating the cost disproportions.”
The Hedvig Distributed Storage Platform provides a unified solution that lets you tailor a high-performance storage with low-cost commodity hardware to support any application, hypervisor, container, or cloud. It is said to provide storage for any compute at any scale for block, file and object storage with programmability and support for any OS, hypervisor or container. In addition, hybrid multi-site replication protects each application with a unique disaster recovery policy and delivers high availability with a storage cluster that spans multiple data centers or clouds. Finally, advanced data services let users customize storage with a range of enterprise services that are selectable per-volume.
“For Hadoop this is critical if you may want some features to be handled by HDFS, and other features to be handled by the storage platform,” said Avinash Lakshman, CEO and Founder, Hedvig.
The Nimble Storage Predictive Flash Platform is said to dramatically improve performance of analytic applications and big data workloads. It achieves this by combining flash performance with predictive analytics to prevent barriers to data velocity caused by IT complexity.