By Terri McClure
The information we store today is very different from the information we stored 30 years ago. In the past few years, every electronic device has become a content capture and sharing device. The Web has changed everything, and new ways of using the Web, considered Web 2.0 applications, have driven a new information economy.
We have entered the Internet Era of computing. Commercial enterprises are adapting the way they use technology to interact with their customers, partners, and even employees, leveraging the Web, and creating a tremendous amount of unstructured, file-based rich data. Within a relatively short time, the majority of capacity under management in the commercial sector will be born as file-based data. Just as small random access file data generated in the distributed computing era dwarfed small random access block data from the transactional era in short order, the large-file, collaborative, data of the Internet Era will do the same within organizations. And, where large orders are still measured in terabytes in transactional or distributed environments, they are measured in multiple petabytes in Internet computing environments.
File storage encompasses a wide range of documents, as well as rich digital content such as video, audio, blogs, and wikis. These types of files are often referred to as unstructured data, and ESG research indicates that data growth in this area is exceeding that of other data types—to an estimate of 62 exabytes of archived file data by 2012, dwarfing database- and e-mail-based archive data (see figure).
It’s not just Web 2.0 driving this data growth; new business models such as cloud computing and service-oriented architecture (SOA) are driving unstructured data growth, too, and that content will co-exist in commercial enterprises, along with transactional and distributed content, requiring a mix of price/performance/functionality. Most traditional storage players thus far cannot support the performance requirements of high-bandwidth file-based data as they are optimized to support small block/file transaction processing, which requires very different performance characteristics. Just as there was a separate—but additive—type of data to contend with during the arrival of distributed computing (file-based versus pure block-based transactional data), data generated in the Internet Era will also exhibit brand-new characteristics. Just as file-optimized storage devices found their way to the mainstream commercial markets to co-exist with traditional core system devices, scale-out NAS arrays capable of addressing the specific attributes and requirements of today’s Web 2.0-generated data will also be required.
Scale-up vs. scale-out
Scale-out is not a new concept: It was long ago applied at the server level and is a key component of blade-based systems. Scale-out is a core requirement of the new generation of NAS architectures. Systems with the ability to independently scale and tune bandwidth, processing, and storage capacity on-the-fly—all while managing the file system in a single global namespace—will be the new backbone of NAS storage. Scale-out NAS is significantly different from the monolithic, scale-up NAS that developed with distributed computing. Scale-out NAS has been around for a while, but it has been tucked away in niches markets such as scientific computing and media/entertainment. But the advent of new models such as Web 2.0, SaaS, and SOA introduces the requirement for scale-out in commercial enterprises.
Scale-up NAS architectures are monolithic, where lots of storage sits behind one or two NAS heads, scaling into the multi-terabyte range. Once the limit on capacity is hit, a new monolithic system is installed, with a new file system to manage. There is no way to share the workload between the systems, and migrating directories or files between systems means remapping and remounting for every client with access. Those who have been through it know the pain of the process; it can be excruciating and expensive, in a large enterprise environment with lots of clients and zero tolerance for downtime.
Scale-out NAS meets the need for independent scale of storage capacity, processors, and bandwidth. Adding capacity and bandwidth, as well as file system expansion, is done online with minimal system performance impact. This granular scaling capability provides a price/performance advantage, as it allows users to start small and scale where needed. Scale-out NAS meets a real market requirement for efficiently dealing with large files typical of unstructured content. Recent ESG research indicates that scale-out NAS will be the fastest-growing segment of the file storage market (in both revenue and capacity) between 2007 and 2012, reaching 6.7 exabytes in 2012.
Enterprise-class features required today, such as remote mirroring, snapshots, and redundant components for high availability and DR, will be a core component of mainstream scale-out NAS systems. But time has taught many lessons regarding manageability, scalability, and efficiency. As a result, combined with the enormous quantity of file-based data that exists and will continue to explode, requirements for scale-out file storage systems will need to incorporate most, if not all, of the following:
Clustering: A clustered file system runs concurrently on multiple physical nodes and is managed as a single entity. Essentially, this removes the limitations of individual devices, thereby removing the boundaries of the boxes and enabling efficient management of multiple file servers. Basically, it offers scale and ease of use. Scale-out systems can start with as few as two nodes, but can scale well beyond. Users can start out small and then grow to a massively parallel system. The performance ceiling is raised by adding more processors, and capacity by adding more storage for “just-in-time” scalability. And they can be easily managed because the entire cluster is managed as a single entity. IT managers simply cannot afford to manage hundreds of file systems individually.
Global namespace: This is a simple concept that is extremely difficult to achieve. Basically, a global namespace is a virtual representation of a group of disparate physical file systems. It is the secret sauce that enables a single point of management and advanced features such as non-disruptive data migration and load balancing.
Power efficiency: Scale-out NAS is inherently power efficient due mainly to its granular scalability.
Self-managing and self-healing: The infrastructure will need to withstand failures and automatically adjust and heal itself. The file storage infrastructure will absorb new processor, bandwidth, and storage capacity, then automatically re-balance and optimize across the newly added resources—with little or no human intervention.
Transparent data mobility: Transparent data mobility is an important feature; it consolidates file-based storage without suffering enterprise-wide downtime, and helps load-balance between processors and disks.
Tiered storage support: Tiered storage support is an advanced feature that will become prevalent in scale-out systems as the market and systems continue to mature. It is an important feature because all data is not created equal. Quite simply, to run a cost-effective IT organization, data needs to be managed and stored according to what stage it is in.
Incorporating all or some of the above features, there are a number of systems on the market today from lesser-known NAS vendors, and the more well-known NAS vendors either have solutions available or have announced their intention to offer scale-out solutions. But there is not yet a clear market leader, creating an opportunity for smaller vendors to establish a beachhead. Who knows. Maybe one will be the next EMC or NetApp. For now, it’s anyone’s game.
Terri McClure is an analyst with the Enterprise Strategy Group (www.enterprisestrategygroup.com).