Big Data storage creates its own growing challenges – and they’re only going to get worse.
It’s Hurricane Season right now in North America – and those storage professionals that seem to think they have weathered the big data storage storm, better watch out. Courtesy of unstructured data storage technologies such as Hadoop, they are beginning to get comfortable in the face of rampant data growth year after year. They ain’t seen nothing yet. Every facet of the storage world – on-prem, private cloud and public cloud is about to be assailed by a data hurricane that will make the last few years seem like a gentle breeze.
“While big data and the Internet of Things (IoT) comprise a tiny fraction of public cloud workloads today, both are growing rapidly,” said Bert Latamore, an analyst at Wikibon. “By 2020, these two domains will feature large in the growth and dynamics of the public cloud market.”
Here are some key tips to help you cope with the onslaught of big data.
1) Big Data Storage, Big Data Problems
One of the biggest challenges with big data storage is the many different types, faces and aspects of big data, said Greg Schulz, an analyst at StorageIO Group, some of it is big fast streaming data including videos, security surveillance while others is log, event and other telemetry, and then there are also large volumes of traditional unstructured files and objects. The common themes, of course, are that there is more data (e.g. volume) with some of that being larger (e.g. size) and that it is unstructured. Thus it is important to understand what type of big data you are dealing with in order to ensure it is addressed appropriately.
“Challenges include how to cope with and scale management without increasing cost and complexity, while at the same tie addressing performance, availability, capacity and economics concerns,” said Schulz. “What this means is rethinking how and where the data gets stored, which also ties to where the applications are located (on premise or cloud) along with how it is accessed (block, file, object).”
2) Application Location
In the old days you could get away with centralizing all data and having applications feed off it. But that approach tends to introduce too many bottlenecks.
“Put the data close to where the applications using it are located; if those applications are in the cloud, then put the data in the cloud and vice versa if local,” said Schulz. “The key is to understand the applications, where they are located, how they use data and then aligning the various technologies to their needs. Also, understand if your applications need object and which API for access, or, if they function with scale-out NAS.”
For example, some apps might be best using HDFS or another other file sharing platform, while others should gravitate to Amazon S3, Swift or other form of object storage. Also keep in mind how you will store and manage metadata to support big data applications, he added.
3) Bifurcated Storage Strategy
451 Research analyst Simon Robinson suggests a future where fast data storage requirements are managed at a flash tier (performance) and everything else scales out into cost-optimized tiers supported by object storage (capacity). There are a variety of storage tiering scenarios that can map to specific enterprise requirements. The key to this is seamless, automated movement between the tiers such that the end-user does not even know the tiering is going on.
4) Think Big Enough For Big Data
When it comes to effectively managing growing volumes of big data, it’s important to take the time to develop a strategy that not only meets your near-term needs, but can scale to support you effectively over time. Otherwise, you end up with software and hardware components that have reached a point where they no longer effectively scale. Therefore, check carefully into how well technology will scale before buying. In a big data world, it better scale enough to deal with the huge influx of storage.
“You can tell when existing software and hardware components have reached a point where they no longer effectively scale: when each additional storage volume added seems to take increasingly more time to manage and when the result of adding it does not seem to add the expected volume and performance,” said Michael King, Senior Director of Marketing Strategy and Operations, DataDirect Networks (DDN).
5) Categorize Meta Data
Categorizing data is wise as that enables you to know what it is and to search the meta data to find it. Long file names may have worked in the past, but not anymore due to growth rates as high as 100 percent year over year.
“Categorizing data is one of the best approaches for dealing with exponential data growth,” said Matt Starr, CTO, Spectra Logic. “Collect meta data at the time of creation, and store at least two copies on different media such as one on tape and one on disk.”
6) Decouple Capacity And Compute
Another tip is to build scale-out storage that decouples capacity from compute. As data becomes larger and larger, it is crucial to build an IT infrastructure that is scalable and fits well to the actual needs, without over provisioning resources.
“A way to accomplish this is to invest in storage infrastructures that can scale capacity and compute independently,” said Shachar Fienblit, Chief Technology Officer, Kaminario.
A storage solution for big data should support multiple protocols and simplify the way data is processed. Real-time analytics makes storage workloads less and less predictable. This is why flash storage is the favorite storage media to store and process big data workloads. Since the cost of flash media declines at a very fast rate, the industry will see more and more big data workloads running on all flash arrays.
7) Commodity Hardware
Scale-out object storage is one of the most effective ways to deal with these issues because data is continuously protected without the need for backups. But how do you keep the hardware costs down?
“Running on commodity X86 servers, object storage allows you to upgrade hardware seamlessly, as these devices function as modular units that can be aggregated without diminishing efficiency,” said Tony Barbagallo, Vice President of Product, Caringo.
8) Long-Term View
When it comes to big data projections, it’s clear that storage managers better plan correctly for growth. Most people, though, don’t span their attention enough – they are used to thinking only one, two or three years ahead. That’s not nearly far enough ahead.
“Think 5, 10, even 20 years ahead,” said Barbagallo. “Make sure you pick a solution that can evolve with your needs and that does not lock you into proprietary hardware.”
9) Don’t Rely Only On Disk
Gartner says we have created more data in past two years than the entire history of human kind. Yet storage architecture changes are not keeping up with the data demand.
According to Kryder’s law, disk density on each inch of magnetic storage would double every thirteen months.
“If the storage density changes are in line with Kryder’s law, by 2020 a two platter 2.5 inch drive would have capacity of 40 TB and cost $40,” said Senthil Rajamanickam, FSI Strategy and Operations Manager at Infogix.
That’s impressive enough on its own, but it’s not going to be enough to cope with all big data. SSD, tape and the cloud will all be needed to keep up with big data growth.
10) Dark Data
Operational data that is not being used is known as dark data. Gartner describes it as “information assets that organizations collect, process and store in the course of their regular business activity, but generally fail to use for other purposes.”
And there is an awful lot of it around.
“Preventing dark data in a big data environment requires data controls to review/monitor instream data during ingestion and capturing metrics to build an inventory of a big data environment,” said Rajamanickam.
11) Capacity Plus Velocity
Most discussion about big data focuses on having enough capacity. But the velocity of the data can be just as much of an issue. Therefore, this factor of big data velocity must be considered before architecting your storage design.
“Supporting event streams that are highly real time is a much different storage demand than dealing with constantly growing log data,” said Rajamanickam.
12) All Cloud Or Part Cloud?
Some will attempt to deal with big data by keeping data in house. But others may prefer to dump it all into the cloud and ensure they manage the data efficiently to control costs. Most, though, are likely to find a middle ground.
“A hybrid cloud approach allows you to continue to operate your system on premises in your data centers and in parallel move some operations to the cloud,” said Jeff Tabor, Senior Director of Product Management and Marketing at Avere Systems. “If storage is your main problem, a first step is to use a storage gateway to move older data to the cloud. If compute is your main challenge, cloud bursting technology lets you leave your data in place in your on-premises data center and begin to process the data in the public compute cloud.”