Evolving use cases for file-based storage

By Noemi Greyzdorf

It is common to find the word "dynamic" used in marketing literature to describe a product or a desired infrastructure. We often equate dynamic with what we strive to achieve. Ironically, dynamic is defined as something marked by usually continuous and productive activity or change.

Recently, the demands on storage have been changing as a result of how information is created, stored, and accessed. More and more, IT organizations are storing unstructured, file-based data. Unstructured data already consumes more than 50% of all storage and is expected to continue to outpace the growth of structured data. The challenge facing storage managers in this evolving climate is to align storage resources with the demands of the data and the applications that are creating it. The goal they are striving to achieve is to create a dynamic storage environment.

File-based data demands file-based storage, which may come in a number of formats: a file server, NAS, NAS appliance, or a combination that is aggregated using file virtualization. Prior to deciding on the format and type of solution to deploy, managers must identify the use cases that exist in their environment and the requirements associated with them.

The most common use cases include storage for virtualized infrastructure, file services, data protection, archiving and content repositories, and high performance computing (e.g., analytics, rendering, and sequencing). Each of these use cases has a unique set of performance, retention, access, security, and capacity requirements.

Storage for virtual servers

Even before the economic downturn, organizations sought ways to reduce waste by increasing resource utilization, simplifying management, and improving their responsiveness to the changing market. In many environments, the processing capacity was greater than the demand by applications. To reduce waste, many turned to server virtualization as a way to optimize their existing resources. The result of virtualization has been not only improved utilization, but also a more flexible and responsive environment with new options for system and application recovery. In this new environment, storage had to become more responsive and elastic.

The main storage challenges in virtual server environments are the ability to provision appropriate storage capacity, manage storage for growth, performance and availability, and protect data in accordance with the requirements of the application.

When deploying virtualization, each virtual machine is its own file. This file can contain the machine image, machine image and data, or just data. As the number of virtual machines increases, the management of storage in support of these virtual machines becomes more complicated. LUN management, machine placement for performance and migration, capacity management, and even troubleshooting become time consuming and complex.

Many organizations have discovered that deploying virtualization on file-based storage eliminates a lot of the complexities associated with storage. Now, each file, whether it is a machine image or data, can be managed separately. Using a scale-out or scale-up file-based storage system can further increase the system's ability to provide timely provisioning, greater resource utilization, appropriate performance, and higher levels of data and system availability (see sidebar, "Scale-out vs. scale-up NAS").

Today, not all virtualization platforms can run on NFS or CIFS; some only run on block-based storage. In these situations, a clustered file system may be deployed across the server cluster to facilitate machine migration and high availability.

File services

In an effort to achieve economies of scale in management and improve resource utilization, many enterprises have initiated consolidation projects for file and print services and for network shares in general. The objective is to simplify the management of storage, manage data more intelligently, and provide value-added services such as timely archiving, data protection, and security. Some of the key requirements for such consolidation projects include: management tools that enable administrators to manage increasing amount of data and storage without having to add head count, timely and accurate provisioning and reclamation of capacity, seamless and live updates and upgrades, and flexibility to add capacity and performance where it is needed. Depending on the unique requirement of an organization, this can be achieved in a number of ways.

  • An organization can deploy file virtualization. File virtualization technology aggregates existing file-based storage devices into a single name space, allowing for the back-end storage to be managed independent of the directory and folder structure. File virtualization also enables capacity to be added without having to migrate users; capacity can come from any storage device abstracted by the file virtualization system.
  • Deploy a scale-up system that can address a theoretically infinite capacity and can support a large number of files. This is typically a single server or dual servers in a high-availability configuration with block-based storage in the back. Some systems may support multiple types of storage, allowing for tiering. To deploy such a system would require replacement of everything currently in the environment. The scale-up systems also depend on the performance configuration of the server and can only scale as the performance capacity of the processors in the system scale.
  • Deploy a scale-out system. This can take the form of scale-out NAS, which is a cluster of nodes that present a global file system name space allowing for capacity and performance to scale as needed. The other way to deploy scale-out systems is to deploy a distributed file system on top of the file servers already in place. The distributed file system serves as an abstraction layer, allowing capacity to be deployed as needed.

The key to selecting the right solution is to understand what is most important to your organization. Solutions vary in complexity, performance, scalability, and support services.

Data protection

Traditional data protection systems and best practices dictate that data is copied from the production system to a secondary system, so if the production system fails there is another copy of the data. The process of copying data has been around for a long time, but the media used to store the secondary copy has changed over the years. The data protection paradigm dictates that a backup of changed data is made every day and a backup of the whole data set is made periodically just in case the whole system has to be restored. This approach creates many copies of the same data over time, which consumes capacity, bandwidth, and performance. Some data protection software has become more intelligent, copying only data that has changed at the block level, thus reducing how much is being written to secondary media. Others have addressed the issue of redundancy by enabling the secondary storage systems with capacity optimization such as single instancing, compression, and deduplication.

The use of file-based storage for data protection purposes has been on the rise for two primary reasons. First, it is easy to deploy, uses standard protocols, supports native replication, and can be used with a variety of drive technologies. Second, it solves some of the main challenges organizations faced when using block-based storage systems. The key challenges included utilization of storage resources, sharing of resources across media servers, and provisioning of storage to media servers in a timely manner.

Archiving and content repositories

Archiving is the most natural use of file-based storage. The amount of data being created continues to grow exponentially. Much more data is being retained to address compliance and governance requirements, as well as the need to support business initiatives and provide support to customers and partners. Since much of this data is unstructured, file-based systems that are cost effective, persistent, and that provide seamless scalability and ease of use are a great solution. Scalability in these cases is dictated not in terabytes but, in some instances, in petabytes.

High performance computing

This is the traditional use case for scale-out and scale-up file-based systems. Most HPC users have a performance requirement. This means that the system has to scale with the need of the application. Not all data though is typically processed all at once, so many HPC users benefit from systems that have dynamic storage tiering capability. With dynamic storage tiering, the system understands usage patterns and moves data across different tiers of storage based on performance characteristics. The movement of data occurs from high performance disk media to lower performing disk media, and vice versa, and is transparent to the application and user.

Technologies such as scale-out and scale-up file-based storage systems with intelligent tiering, capacity optimization, thin provisioning, knowledge-based data management, and standard components offer a way to address the needs of the primary use cases discussed above. Depending on the solution and its architecture, one may be a better option for a given use case than another.

We started this article by defining "dynamic." If the demands are continuously changing, then so should the environment to support them. Whether the resources are deployed inside the enterprise or subscribed to from a service provider, the key is to have these resources available when they are needed in the form they are needed.

NOEMI GREYZDORF is a research manager with IDC,  www.idc.com

File-based storage in the cloud  

The acquisition of storage for the use cases discussed in this article can be achieved in two main ways: purchase it for internal consumption (private cloud storage) or subscribe to it as a service (public cloud storage)

In a public cloud scenario, storage resources are paid for based on capacity and service levels agreed to in the contract. The physical storage resources are housed in a data center managed by the provider and are accessed by subscribers via a network. Some providers deploy a gateway to their cloud inside the data center, such as a NAS gateway that allows applications to access the cloud via standard protocols such as NFS or CIFS; some allow access via NFS or CIFS over the network; and others have proprietary protocols that allow users to connect to their resources. Storage in these instances can also be delivered as part of an application. An example would be data protection, replication, or archiving. In these instances software is used to move the data from the enterprise to the cloud.

Private storage clouds are a bit more complicated. Let's assume that the reason cloud providers can offer storage at a lower per-gigabyte cost than if provisioned internally is because of economies of scale and scope. The providers can deploy, provision, and manage storage more efficiently. The lower cost of acquisition and operations, though, is something every enterprise wants.

Now let's assume that an enterprise can replicate economies of scale and scope internally. This means that the storage that is deployed can be provisioned based on the need of a business unit on an as-you-need basis. In this scenario, the IT department becomes the service provider to the business unit. In a sense, they become a private storage cloud vendor.

Now the question is how an enterprise storage manager can achieve these results. The answer is not simple. A private storage cloud can be file-based or block-based. The majority of storage capacity is being consumed by files, and the applications and users creating these files have variable requirements. Ideally, it would be advantageous to deploy one type of file-based storage solution that can be designated for all the use cases and their performance, security, capacity, and access requirements. This would be the ultimate dynamic data center, but that might be a stretch in today's market.

Another option is to deploy as few storage solutions for as many use cases as possible, where the storage system has the following characteristics and features:

  • Seamless scalability, and the ability to increase capacity and/or performance with zero downtime.
  • 100% availability, 100% of the time -- including scheduled maintenance
  • Optimal capacity utilization
  • Manageability that enables as few administrators as possible to support increasing amount of storage
  • "Green" technology; e.g., high density, low power consumption, and intelligent cooling design to reduce operating costs
  • Interoperability, with standard components and protocols to allow timely upgrades and adoption of next generation of technology 

Scale-out vs. scale-up NAS 

Scale-out refers to the ability to scale a file-based system by adding nodes to the cluster, the same way a fast food restaurant would increase its ability to take orders by adding more cashiers. In general, scale-out NAS has a name space that can span multiple nodes while allowing access to data through any of the nodes in the name space.

Scale-up refers to the ability to scale a file-based system by replacing its hardware with faster components, the same way a fast food restaurant might increase its ability to take orders by replacing the existing cashier with a faster cashier and a faster register. In general, the name space can span only one node or two nodes clustered for high availability.

More InfoStor Current Issue Articles
More InfoStor Archives Issue Articles

This article was originally published on March 01, 2010