Hadoop is immensely popular today because it makes big data analysis cheap and simple: you get a cluster of commodity servers and use their processors as compute nodes to do the number crunching, while their internal direct attached storage (DAS) operate as very low cost storage nodes.
The benefit of this is that the storage is close to the processing so your data doesn't need to be shunted around much, and you can add more storage (and more compute power) just by adding more cheap server nodes with low cost SATA drives.
The Hadoop Distributed File System (HDFS) takes responsibility for organizing and marshalling all this storage. To add an element of data protection, fault tolerance and cluster resilience to your Hadoop cluster, HDFS triplicates all the data: one copy is made and stored on a different storage node in the same rack, and the other sent to another node on another rack.
This approach has benefits and drawbacks. Triplicating data is simple and effective, but when you are using large data sets this can result in huge volumes of data. The good news is that if a disk does fail you simply stick a new one in and you are back in business – there is no RAID rebuild time and little in the way of processing overhead to provide this data protection.
HDFS was designed to support batch style MapReduce queries (the original processing model Hadoop used to do its magic). But with the release of Hadoop 2 and YARN (Yet Another Resource Negotiator), Hadoop became more than just a system for running the MapReduce algorithm. Suddenly it was possible to run other processing systems like Tez, HBase, Storm, Giraph, Spark and more.
"What's happened is that Hadoop has become a task scheduler, doing MapReduce and other things that can advantage of the Hadoop Cluster," says Mike Matchett, a senior analyst at Taneja Group. "So now we are seeing people say "why not use Hadoop to store all your data - a "data lake" or what IBM calls a "data refinery?""
But there's a problem with this: DAS may be cheap, but there's a reason that most enterprises have moved away from it for "conventional" storage, adopting instead more sophisticated enterprise storage systems. DAS lacks several vital enterprise storage capabilities. The most important, Matchett says, include:
· Compliance and regulatory controls
· Security, access and audit controls
· Multi-site data protection
· Disaster recovery / business continuity
· Workflow integration and multi-user sharing
· System manageability at scale
· Consistent performance and multiple workloads
"What can manage data like that? Enterprise data storage!" he says.
Of course there are problems associated with using Hadoop with external (non-DAS) storage. Most obviously, moving the data away from the compute nodes introduces potential latency issues and bandwidth constraints, and adding (for example) fibre channel HBAs to a large number of compute nodes could be prohibitively expensive.
There are other issues, too. Enterprise storage arrays are expensive, they may have scaling limitations, and unless they are dedicated to the Hadoop cluster your other workloads could compete or interfere with them, Matchett points out.
So what are the storage options for Hadoop? Matchett identifies several:
1. DAS - the original Hadoop architecture. This can be enhanced by adding flash devices and tiering software to enhance performance and potentially allow each node to answer more queries, but this does nothing to address the lack of enterprise storage services.
2. Enhanced DAS - Hadoop with a proprietary data storage software layer in place of HDFS.
3. SAN/NAS/appliance - an enterprise storage system or appliance.
4. Virtualization - Hadoop running on virtual machines.
5. Cloud - cloud "clustered storage" like Red Hat Storage.
Let's take a look at these latter four.
One storage approach is to continue using DAS in the way originally envisaged by Hadoop, but to replace the HDFS data storage layer with something more sophisticated.
That's effectively what is achieved if you use MapR's distribution of Hadoop. San Jose-based MapR's distribution discards HDFS - which, remember, is optimized for big block reading and append writing (not rewriting) - and instead uses a storage system that offers point in time snapshots, mirroring and other data services to provide high availability, disaster recovery, security and full data protection.
It also enables NFS reading and writing, providing fine grained access to your data and allowing other applications (such as SQL applications) to use the data as well as MapReduce (or whatever else you may be using for Big Data analysis.) Effectively it allows Hadoop data to be accessed as network attached storage (NAS) with read-write capabilities.
"MapR isn't selling hardware or storage – they are just adding storage capabilities to Hadoop in software," says Matchett. But you are losing open source purity – not that that will necessarily put many companies off, he adds.
This option involves hardware such as EMC's Isilon scale-out NAS storage, NetApp FAS storage, DDN's hScaler array, or CleverSafe's Slicestor appliances.
The approach taken with EMC's Isilon is to move HDFS out of the Hadoop cluster so that Hadoop compute nodes talk to the Isilon storage array – which has native HDFS integration. That means no replicating your existing data to the relative wild west of Hadoop clusters before it can be analyzed, and you can even expose your data to multiple instances of Apache Hadoop distributions from different vendors simultaneously, according to EMC.
"You get all the benefits of a scale out system, with capacity optimizations like RAID, deduplication and so on," says Matchett. "But, you still have to use a network connection (to the compute nodes) so there may be a sizing sweet spot, and the storage is not commodity priced," he says.
The NetApp Open Solution for Hadoop approach uses high-end FAS storage arrays to provide storage for Hadoop namenodes (which hold metadata) and dedicated SAS drives for each slave HDFS data node. This provides high reliability, and high performance IO to each node (which means you may get away with fewer data nodes.) "NetApp are saying "Instead or running DAS, use cheap RAID storage on every other node, but run high end storage with RAID controllers and flash on namenodes,""says Matchett.
Again, the disadvantage is the cost. "This approach has merit," says Matchett, "but it could be the worst of all worlds too if you have to add NetApp arrays all the time."
CleverSafe's Hadoop solution uses its dispersed storage technology on Slicestor nodes as Hadoop storage nodes with an HDFS compatible interface. Using dispersed storage technology reduces the amount of data required for redundancy (improving storage efficiency) and means that existing customers that use Cleversafe for general archiving can expose this data to Hadoop as well.
But Matchett warns that Cleversafe's dispersed storage system will slow Hadoop down when writing and reconstructing data.
By virtualizing your Hadoop infrastructure it’s possible to create separate virtual machines for compute nodes and storage nodes. Then you can spin up more compute nodes when you need them, while keeping the storage nodes running, sharing data with different compute clusters.
Virtualized nodes can also access external storage systems, including cheap storage provided by the likes of VMware's Virtual SAN (VSan).
One benefit of virtualizing Hadoop is that when network traffic between nodes is local (i.e. going from one VM to another within the same physical host) there is zero latency as the traffic goes through the hypervisor's virtual switch rather than out to a physical one. That means that virtualized Hadoop clusters can actually outperform physical ones.
A drawback of this approach is that instead of cheap hardware you are likely to need $100,000 servers to host these VMs, and if Hadoop needs the whole box that may work out to be expensive, Matchett warns.
An interesting innovation for virtualized Hadoop is the recently launched Bluedata EPIC platform. This provides a way to run virtualized Hadoop compute clusters, but a layer of software called DataTap then makes any external storage look like it has an HDFS interface, so in effect you can use any shared storage with Hadoop. "You could set up small clusters and let them access any data wherever it is," says Matchett. "They could go and look at a (EMC) VNX or Isilon or whatever."
There's also the possibility of using cloud storage for storage nodes when Hadoop is being run in the cloud – using something like Amazon Elastic MapReduce, or Microsoft's HD Insight Azure Hadoop web service.
The problem is that you have to get your Hadoop data into the cloud storage somehow – unless it is already there (perhaps because it is created or collected there in the first place) – and Hadoop data sets tend to be very big.
One way of getting the data there would be to use a service such as NetApp Private Storage for AWS or Private Storage for Azure, which offers a high capacity pipe from a nearby facility straight in to an Amazon or Azure cloud data center using a service called AWS Direct Connect or Azure ExpressRoute. This offers rapid access to your data to Hadoop in the cloud, and allows you to keep it under your control.
Another possibility, more suited to cloud providers than enterprises, could be to use Red Hat Storage in the cloud to turn a cluster of server nodes into a scale-out enterprise featured cloud storage array. The same server nodes could be running virtual Hadoop compute nodes as well, and the benefit would be that you gain storage features like data protection and high availability just by replacing HDFS data nodes with other commodity DAS.
Hadoop Going Forward
There's no doubt that Hadoop as originally designed provides a low cost way to store and analyze vast amounts of data. But it lacks vital controls for compliance, security, availability and manageability, and makes it hard for other applications to access the data stored in the cluster's nodes.
Using enterprise storage systems, virtualization or cloud storage, it is possible to overcome these problems – but at a cost. The question is whether the cost is worth it.
"Storage economics will always be important, and companies want inexpensive storage (for Big Data),” says Richard Fichera, a principal analyst at Forrester Research. "Companies need to be sophisticated to analyze these storage services, and sometimes it will be worth going with them even if they are not the cheapest systems."
Enterprise storage for Big Data may be more expensive," Taneja Group's Mike Matchett concurs, "but often it will be worth it."
Photo courtesy of Shutterstock.