Most enterprise-level storage systems tier primary workloads to optimize performance and capacity across their flash and disk tiers. However, there is a whole world of storage tiering that takes data from the moment of its creation to the end of its lifecycle.
First let’s look at the basic taxonomy of storage tiering from primary to cold.
· Primary/Production tiering. This is the top performance tier. It’s not strictly a single tier since some arrays tier internally between flash and HDD tiers, such as Tier 0 and a flash/HDD Tier 1, and/or between Tier 1 and a back-end Tier 2 with SATA disk. Representative array/storage pool tiering products include Dell Storage Center OS and SC arrays, HDS, EMC, and IBM arrays, and Oracle FS1.
· Secondary/Nearline tiering. “Secondary storage” used to mean a data protection/backup tier. Although secondary storage is still an accepted term for backup storage, it also means on-premise storage that stores data cheaply while preserving user access. Examples include tape libraries from companies like SpectraLogic and high capacity NAS. The cloud can also be an active secondary tier with sufficient on-premise integration and bandwidth: MS StorSimple with Azure is an example of fully integrating the cloud into the storage infrastructure. And Cohesity Cloud Spill automatically tiers active data onto Amazon S3 as a cluster extension.
· Cold tiering. Also called deep archives, this is the largest and fastest-growing storage tier in the world. Some taxonomies break up this tier into three archival tiers, indicating the need to store data long-term while providing visibility, reporting, and access. A few vendors like Facebook locate massive cold tiers in their own data centers. (Facebook operates several data centers that use dense Blu-ray disk for archived user photos and videos.) However, cold storage is quickly moving to the cloud with its near-infinite scalability and low prices for cold data storage. Amazon Glacier and Google Nearline are popular cloud-based cold storage tiers. They offer similar low pricing although Google Nearline offers significantly faster restore times (minutes as opposed to hours).
Past The Primary: Storage Tiering Drivers
At some point you will want to move aging data off production systems to lower-priced systems/storage tiers. Below are the primary decision factors for introducing multiple tiers to your storage infrastructure: controlling costs and providing levels of access. Not everyone will consider these factors in exactly the same proportion but everybody has them.
Control Costs
Storage is one of the largest single cost centers in the data center. About five years ago, storage represented about 20% worth of the cost of a fully loaded computing stack. Last year in 2015, it represented about 40% of the computing stack thanks to factors like machine-generated and mobile data.
Tiering storage is not the only storage management option you need, of course; you also want to control data copies and manage storage capacity with thin provisioning, dedupe and compression, and delta-level backup. But you will need to keep a big portion of that data for a variety of use cases, including litigation and regulatory compliance, digital assets reuse, and historical analytics. Tiering this aging but still needed data lets you save on storage costs by moving aging data to less expensive storage tiers.
This is especially true with your top-of-the-line primary storage arrays. Although these systems usually auto-tier data between flash and disk tiers, eventually you will need to migrate data off of production storage to preserve primary performance and capacity. Reaching end-of-life on these expensive systems is painful enough, but hitting capacity limits pre-retirement is worse.
Granted that you are less likely to do that with scale out architecture, but even then you must pay for new nodes, bandwidth, and licenses. Whether you invest in scale-out or not, you will want to tier off aging data. You have many choices for where, including in-system: some storage systems like EMC DataDomain are built with a secondary tier for backup and/or archive, and hyperconverged systems like Nutanix include tiering functionality. Still, at some point aging data should tier from a production system to a less expensive storage location whether on-site or off.
Keep Data Accessible
Another decision factor is the level of accessibility you need to data on lower storage tiers. Some data only requires proof of existence and possibly location for compliance reporting. This data can be moved to the cheapest possible storage tier up to and including off-site tape vaults as long as you have access to compliance reports.
In contrast, some business processes need reasonable access to cold data for eDiscovery, investigations, digital asset reuse, and even business analytics. IT needs to provide that level of access – for example, enabling attorneys to search through a large volume of email on cold storage, without having to download the entire repository first. You will also want to control recovery costs for large cold data downloads.
Amazon Glacier is an excellent example of this. It is very cheap storage (that Amazon swears is disk-based) that averages hours to recover, and keeping storage there is economical if you do not need to recover it quickly. But not all cold storage should recover in hours or more: some cold storage data needs to remain easily recoverable for analytics, forensics, and E-discovery actions. Google Nearline is the better choice for this level of access, as it provides recovery rates in minutes rather than hours.
Another cold storage method is migrating bulk data from secondary on-premise storage to object storage in the cloud. SpectraLogic’s Black Pearl/Spectra S3 is an example of a REST API that enables data object transfer between on-premise and cloud storage tiers.
There are a lot of options out there for storage tiering. You need to know your own environment and needs in order to ask the right questions and get the right answers. .
These questions include:
How long will it take to search a cold repository and how do I report results?
How fast can I access and recover the data and how much will it cost me?
How can I access reports that prove I have compliantly stored this data?
You do not have to put infrastructure-wide tiering in place all at once, but understanding your needs and available technology will help you make the best decisions.