Big Data Storage: Top Ten Tips for Scaling Hadoop

Posted on October 06, 2016 By Drew Robb

RssImageAltText

In the 1990s, each application server tended to have direct attached storage (DAS). SANs were created to provide shared, pooled storage for greater scale and efficiency. Hadoop has reversed that trend back towards DAS. Each Hadoop cluster has its own — albeit scale-out — direct-attached storage. It helps to Hadoop manage data locality, but it trades off the scale and efficiency of shared storage. If you have multiple instances or distributions of Hadoop, therefore, you’ll have multiple of these scale-out islands of storage.

“The biggest challenge we come across is balancing data locality with scale and efficiency,” said Avinash Lakshman, CEO and Founder, Hedvig.

Data locality is about making sure a big data set is stored near the compute that performs the analytics. For Hadoop, that means managing DataNodes that provide storage for MapReduce to perform adequately. It works effectively, but leads to the separate operational issue of islands of big data storage. Here are some tips on how to manage big data storage in a Hadoop environment.

1. Decentralize Storage 

Centralized storage has been the traditional for some time now. But big data is not really suited to a centralized storage architecture. Hadoop was designed to move computing closer to data while making use of massive scale out capabilities of the HDFS file system, advised Senthil Rajamanickam, FSI Strategy and Operations Manager at Infogix.

The common approach to solving the inefficiencies of Hadoop managing its own data, however, has been to store Hadoop data on a SAN. But that creates its own performance and scale bottlenecks. Now all of your data is processed through centralized SAN controllers, which defeats the distributed, parallelized nature of Hadoop. You either need to manage multiple SANs for various DataNodes, or sane all DataNodes to one SAN.

“As Hadoop is a distributed application, it should run on distributed storage so your storage retains the same elastic nature as Hadoop itself,” said Lakshman. “It requires that you embrace a software-defined storage approach, running atop commodity servers, but it’s far more effective than bottlenecking Hadoop by putting it on traditional SAN or NAS technologies.

2. Hyperconverged v Distributed

Be careful, though not to confuse hyperconverged with distributed. Certain hyperconverged approaches are distributed, but typically the term means your application and storage will be co-resident on the same compute node. That’s tempting to solve the data locality issue, but it can create too much resource contention. The Hadoop application and storage platform will be contending for the same memory and CPU. It’s better to run Hadoop on a dedicated application tier and run your distributed storage in a dedicated storage tier, taking advantage of caching and tiering to solve data locality and network performance penalties, said Lakshman.

3. Avoid Controller Choke Points

He stressed an important aspect of achieving this – avoid processing data through a single (or maybe dual) point such as a traditional controller. By instead making sure the storage platform is parallelized, performance can be dramatically improved.

In addition, this approach offers incremental scalability. Adding capacity to the data lake is as easy as adding a few x86 servers with flash or spinning disks in them. A distributed storage platform will automatically add the capacity and rebalance the data as necessary.

4. Deduplication and Compression

A key part of staying on top of big data is deduplication and compression. Hedvig is seeing 70% to 90% data reduction for common big data sets. At petabyte scale, this can mean tens of thousands in disk costs.

“Modern platforms provide inline (as opposed to post-processing) deduplication and compression,” said Lakshman. “That means the data never hits disk without first being reduced in some way, greatly decreasing the capacity needed to store data.”

5. Consolidate Hadoop distributions

Many large organizations have multiple Hadoop distributions. It may be that developers need access to multiple “flavors,” or business units have adopted different version over time. Regardless, IT often ends up owning the ongoing maintenance and operations of these clusters. When big data volumes really begin to impact a business, the presence of multiple Hadoop distributions introduces inefficiency. 

“You can gain data efficiencies by creating a single, deduplicated, compressed data lake that can then serve data up to multiple instance of Hadoop,” said Lakshman.

6. Virtualize Hadoop

Virtualization has taken the enterprise world by storm. Somewhere in excess of 80% of physical servers in many areas are now virtualized. Yet many have avoided virtualizing Hadoop due to performance and data locality issues.

“You can virtualize Hadoop or Spark,” said Lakshman.

7. Build an Elastic Data Lake

It isn’t easy to build a data lake, but the demands of big data storage will probably demand it. There are many ways to go about it, but which is the right way? The right architecture should lead to creation of an active and elastic data lake that can store data from all sources and in multiple formats (structured, unstructured, semi-structured). More importantly, it must support the execution of applications right at the data source, and not from a remote source requiring data movement.

Unfortunately, traditional architectures and applications (i.e., non-distributed) have not been satisfactory. As data sets are getting larger, it’s imperative to move applications to the data, and not the other way around as there’s too much latency. And with the introduction of Hadoop/Spark, analytics workflows are becoming even more disruptive as data and applications are being executed from different silos, and forcing data to be moved and stored on multiple platforms.

“The ideal data lake infrastructure will enable the storage of a single copy of data, and have applications execute on the single data source without having to move data or make copies (for example, between Linux, VMs and Hadoop),” said Fred Oh, Senior Product Marketing Manager, Big Data Analytics, Hitachi.

8. Integrate Analytics

Analytics is not a new capability, having existed in traditional RDBMS environments for many years. What is different is the advent of open source-based applications and the ability to integrate database tables with social media and unstructured data sources (e.g., Wikipedia). The key is the ability to integrate the multiple data types and formats into one standard so that visualization and reporting can be done more easily and consistently. Having the right tool set to accomplish this is vital to the success of any analytics/business intelligence project.

“When it comes to analytics, it’s important to understand that the real challenge is not in visualization, but in data integration, especially data from multiple sources and in multiple formats,” said Oh. “A comprehensive library of data integration tools and a GUI-based integration console can solve enterprise challenges with big data.”

9. Big Data Meets Big Video

Big data is bad enough. But an emerging strain of this phenomenon is big video. For example, enterprises increasingly use video monitoring for not only security, but also operational and industrial efficiencies, streamlining traffic management, supporting regulatory compliance and several other use cases. Very soon, these sources will generate ridiculous amounts of content. Those having to deal with it had better make sure they establish the right kind of data store for it, Hadoop-based or otherwise. 

“These applications are driving a flood of big video data that, without the right specialized storage solutions, can lead to issues such as data loss, and video degradation,” said Oh.

10. No Winner 

Hadoop has certainly gained a lot of ground of late. So will it be the ultimate winner, besting all other approaches as big data storage volumes mushroom. Not likely.

Traditional SAN-based architectures, for example, will not be replaceable in the near-term due to their inherent strengths with OLTP and 100% availability needs. But when analytics and data integration with unstructured data is required (e.g., social media), then there can be a compelling argument to evaluate hyper-converged platforms which incorporate server compute, distributed file systems, Hadoop/Spark, and newer database applications with open sourced based analytics tools.

The best approaches, therefore, incorporate hyper-converged platforms with a distributed file system and integrated with analytics software. Traditional Linux-based RDBMS applications (DWO, Data Marts, etc.) serve their purpose, Hadoop/Spark/MapReduce serve new social media challenges, and the use of server virtualization provides flexibilities and efficiencies. But each of these environments may create separate data silos. The ideal approach will support all three simultaneously, add the ability to execute applications at the data source and reduce data movement in an analytics workflow.

“The keys to success are implementations that factor-in scalability, analytics integration and expertise,” said Oh. “Ultimately, storage professionals need to anticipate future needs and think beyond just storage.


Comment and Contribute
(Maximum characters: 1200). You have
characters left.