Users should consider archiving, data compression, and data de-duplication.
By Greg Schulz
August 15, 2007—Organizations of all sizes are generating and depending on larger amounts of data that must be readily available and easily accessible. This growth in data results in an ever-increasing data footprint: More data is being generated, copied, and stored for longer periods of time.
Consequently, IT organizations have to effectively manage more infrastructure resources, including servers, networks, and storage, to ensure data is protected in a timely manner while at the same time providing adequate performance and capacity and securing data for access when needed.
Your data footprint is the total storage capacity needed to support your various business applications and information needs. Your data footprint may, in fact, be larger than how much actual storage you have, or, as in the following example, you may have more aggregated data storage capacity than actual data.
As an example, say you have 2TB of Oracle database instances and associated data, 1TB of Microsoft SQL Server data, 2TB of Exchange e-mail data, and 4TB of shared NFS and CIFS file-sharing storage, resulting in 9TB of data; however, your actual data footprint could be much larger. The 9TB simply represents the known data or how storage is allocated to different applications and functions. If the databases are sparsely populated at 50%, for example, only 1TB of Oracle data actually exists while occupying 2TB of storage capacity.
Assuming, for now, that in the above example the capacity sizes mentioned are fairly accurate to the actual data size based on how much data is being backed up during a full backup, your data footprint would include the 9TB of data as well as the online (primary), nearline (secondary), and offline (tertiary) storage configured to your specific data protection and availability service requirements. For example, if you are using RAID-1 mirroring for data availability and accessibility, in addition to sending your data asynchronously to a second site where the data is protected on a RAID-5-based volume with write cache, as well as a weekly full backup, then your data footprint would be at least 37TB (9 x 2 for RAID 1) + (9+1 for RAID 5) + (9 for full backup).
Your data footprint could be even higher than 37TB in this example if you also assume that daily incremental or periodic snapshots are performed, in addition to the extra storage required to support application software, temporary work space, and operating system files, etc.
As can be seen from this example, 9TB of actual or assumed data can rapidly expand into a much larger data footprint. Note that the above scenario is rather simplistic and does not factor in how many copies of duplicate data may be being made or backup retention, size of snapshots, free space requirements, and other items that contribute to the expansion of your data footprint.
Reducing the data footprint
While storage capacity has, in fact, become less expensive, as your data footprint expands, more storage capacity and storage management—including software tools and IT staff time—are required to manage and protect your business information. By more-effectively managing your data footprint across different applications and tiers of storage, you can enhance application service delivery and responsiveness, as well as facilitate more timely data protection to meet compliance and business objectives.
Reducing your data footprint can help reduce costs and allow you to defer upgrades to expand server, storage and network capacity, along with associated software license and maintenance fees. Maximizing what you already have by using data footprint reduction techniques can extend the effectiveness of your existing IT resources, including power, cooling, capacity, network bandwidth, replication, backup, archiving, and software license resources.
From a network perspective, by reducing your data footprint or its impact, you can also positively impact SAN, LAN, MAN, and WAN bandwidth for data replication and remote backup or data access, as well as move more data with existing bandwidth. Additional benefits of maximizing the usage of your existing IT resources include
- Deferring hardware and software upgrades;
- Enabling free space to facilitate consolidation and data migration to energy efficient platforms;
- Decreasing the time required for data protection, including file system scans and data movement;
- Reducing your power and cooling requirements by increasing utilization of existing storage;
- Expediting data recovery and application restart for disaster-recovery scenarios;
- Lowering the impact from file system scans for backup and other overhead functions; and
- Reducing exposure during RAID rebuilds due to faster copy times and denser data.
IT organizations have taken different approaches to address the challenges associated with a growing data footprint while balancing service delivery (performance, availability, capacity, compliance) with cost, including operating expense (OPEX) and capital expense (CAPEX), while ensuring compliance and business continuance (BC) or disaster-recovery (DR) requirements are being met. While DR and compliance have been in the news recently, along with data security, another topic that is gaining attention is "green" storage and IT infrastructure—specifically, reducing power and cooling costs.
For some organizations, the solution to reducing data footprint involves restricting the use of storage. Examples include limiting database size and/or placing restrictions on e-mail box size and user disk space quotas. While limits and quotas can have their place, their implementation should not hinder users' productivity.
Another approach is to simply add more hardware. After all, disk prices continue to drop rapidly. However, bear in mind that while disk hardware can be relatively inexpensive, it still requires software and management, including backup and other functions, which result in personnel and other "soft" costs.
Three approaches to reducing the data footprint include archiving, data de-duplication, and data compression.
Archiving unused data
Data archiving can have one of the greatest impacts on reducing your data footprint for storage in general, but particularly for online and primary storage. For example, if you can identify in a timely manner what data can be removed after a project is completed or what data can be purged from a primary database or older data migrated out of active e-mail databases, you should realize a net improvement in application performance as well as available storage capacity.
A challenge with archiving is having the time and tools available to identify what data should be archived and what data can be securely destroyed when no longer needed. Further complicating archiving is that knowledge of the data value may also be needed; this may well include legal issues as to who is responsible for making decisions on what data to keep or discard. If you can invest in the time and software tools, as well as identify which data to archive to support an effective archive strategy, then the returns can be very positive toward reducing your data footprint without limiting the amount of information available to your business.
SIS and data de-dupe
Single instance storage (SIS), or data de-duplication, assumes that duplicate files exist on a server or storage system being backed up, and that over time the same unchanged files get repeatedly backed up. SIS works by normalizing the data being backed up and subsequently stored; that is, instead of storing each file containing the same data, keep one copy of the actual data and maintain multiple pointers to the data representing the various files being backed up.
The benefits of pointer-based PIT snapshots are speed of data protection and less storage required for rapid retrieval of data. SIS approaches trade processing time to ingest and eliminate duplicate data for a savings on storage capacity to store backed-up data. This assumes there is a high degree of commonality and repeating data files being backed up. Consequently, SIS and data de-duplication solutions perform best when deployed in support of backup operations, and to a lesser degree for archiving. Data de-duplication may not be practical for online applications today. Some SIS-enabled solutions, such as virtual tape libraries (VTLs), also combine data compression to further reduce data footprint requirements.
Compression can be used not only for backup and archive, but also for primary storage, and is widely used in IT as well as in consumer electronics. It is implemented in hardware and/or software to reduce the size of data, creating a corresponding reduction in network bandwidth and storage capacity. Compression is complementary to archiving, backup and other functions, including supporting primary storage and data applications. For example, compression is commonly implemented in several locations, including databases, e-mail, operating systems, tape drives, network routers, and compression appliances.
Some data de-duplication solutions boast spectacular ratios for data reduction, given specific scenarios such as backup of repetitive files, while providing little value over a broader range of applications. This is in contrast to data compression approaches that provide lower, yet more-predictable and consistent data-reduction ratios, over more types of data and applications, including primary storage. For example, in environments where there is little or no common or repetitive data files, data de-duplication will have little to no impact while data compression generally will yield some amount of data footprint reduction across almost all types of data. Some data de-duplication vendors have either already added, or have announced plans to add, compression techniques.
Data footprint reduction
Many vendors' sales pitches lead with messages focused on reducing OPEX or CAPEX costs or doing more with less, which for many environments is a good thing. Doing more with what you already have can be interpreted as "doing more with less;" however, it also has different meanings (e.g., increasing capacity utilization of existing disk and tape systems, leveraging virtualization techniques to consolidate server workloads, maximizing available power and cooling capacity, etc.).
There are many different attributes to consider when evaluating data footprint reduction technologies. Which features are the most important for you will depend on your environment and requirements.
In the storage market, there has been a vertical- or product-centric focus on how to reduce data footprints including, for example, reducing the amount of stored data and associated storage capacity to improve backup objectives. Another focus has been on promoting fixed-content archiving, e-discovery content search, and data indexing for legal, regulatory, or other compliance purposes. There are many opportunities to reduce your data footprint to improve overall delivery of service and enhance management and the ability to reduce spending on hardware and software.
One issue to consider is how much delay or resource consumption you can afford to use or lose to achieve a given level of data footprint reduction. For example, as you move from coarse (traditional compression) to granular (data de-duplication) technologies, more intelligence, processing power, or offline post-processing techniques are needed to look at larger patterns of data to eliminate duplication. Similarly, understand what delays may occur as a result of using SIS-based data footprint reduction techniques during large-scale bulk data restorations.
You may want to consider a data footprint reduction strategy that combines various technologies to address specific applications as well as your overall environment, including online, nearline backup, and offline archiving. Following are some general recommendations and suggestions to help address your growing data footprint, all of which depend on the size and scope of your particular environment, applications, and service requirements.
- If you are evaluating data footprint reduction technologies for future use, including archiving with data discovery (indexing, e-discovery), consider leveraging appliance-based compression technology to maximize the capacity of existing storage resources for online, backup, and archiving, in conjunction with other data footprint reduction capabilities;
- Maximize use of your existing IT resources without introducing complexity and costs associated with added management and interoperability headaches. Look for solutions that complement your environment and are transparent across different tiers of storage, business applications, and other functions (backup, archive, replication, etc.);
- Data archiving should be an ongoing process that is integrated into your business and IT resource management functions, as opposed to being an intermittent event to free up IT resources; and
- Get a handle on your data footprint and its impact on your environment using analysis tools and/or assessment services. Develop a holistic approach to managing your growing data footprint. Look beyond storage hardware costs, and factor in software license and maintenance costs, as well as power, cooling, and staff management time.
There are several different techniques that can be used individually to address specific data footprint reduction issues, or those techniques can be used in various combinations to implement a more comprehensive and effective data footprint reduction strategy. The benefit of a broader, more-holistic, data footprint reduction strategy is to address your overall environment, including all applications that generate and use data as well as overhead functions that impact your data footprint.
Reducing your data footprint has many benefits, including maximizing the usage of your IT infrastructure resources such as power and cooling, storage capacity, and network bandwidth, while enhancing application service delivery in the form of timely backup, BC/DR, performance, and availability.
Look to combine technologies and techniques to address your various data footprint challenges, and to maximize your IT resources while reducing management costs and complexity.
Greg Schulz is founder and senior analyst of the StorageIO Group and author of the book, Resilient Storage Networks—Designing Flexible Scalable Data Infrastructures (Elsevier Digital Press).