Storage Service Management Guidelines
Service level agreements and the right software facilitate management and availability planning for backup and recovery.
By Juergen Ketterer and Martin Haworth
Users often perceive data backup as an irritation. Without this necessary evil, however, continued availability of services can be compromised in the event of operational error or software or hardware failure. For this reason, data backup and recovery are critical elements in IT service delivery and management.
As IT departments evolve and become internal service providers, proactive management and monitoring of quality of services will become that much more critical. IT management can use service level measures and reports to demonstrate the value that is delivered to organizations and to help maintain competitive cost structures in light of trends toward selective outsourcing.
Another key tool is backup and recovery management software, which provides IT service managers with key data to proactively monitor and plan backup and recovery operations. This information can be leveraged in service level agreements (SLAs) and can be used to implement cost management and chargeback models for financial management.
This article provides an overview of storage service management and shows how it can be applied to availability, cost management, and data assurance planning and operations. Three scenarios are presented.
Service Level Agreements
Most storage service level agreements cover three categories: availability, time to recover, and cost recovery (or chargeback).
- Availability is defined as the percentage of time service is accessible and usable by users within agreed time constraints.
- Time to recover defines how quickly a service is restored in the event of an outage. This SLA also defines the maximum number of outages that are acceptable within an agreed time frame.
- Cost recovery is an optional activity that is a natural extension of IT`s position as a service provider. The cost of delivering the service is recovered directly from users based on a model that reflects users` consumption of the service. In the case of backup operations, you might choose to bill users based on the number of gigabytes of data backed up during each operation. This approach is much more egalitarian than simply distributing the cost of the backup and recovery infrastructure equally across all users--although that approach also has merits (primarily, simplicity).
Clearly, backup and recovery plays a significant part in supporting these service level objectives (SLOs). IT management needs to make sure curfews during data backups do not affect the availability of services at agreed times, for both interactive and batch-processing activities. IT management also needs to ensure that data recovery operations--and other remedial activity (such as replacing a defective hardware component)--can be completed within the mandated time to recover.
In addition to ongoing monitoring and management of the backup service, it is highly desirable to proactively analyze trends in backup performance (e.g., the duration and amount of data processed). This enables potential issues to be identified ahead of time and remedial activities to be planned and implemented before the backup service negatively affects the availability of application services.
Backup/Recovery Service Monitoring
Three related areas of a service level agreement are:
- Backup duration
- Restore duration
- Gigabytes of data processed
Backup and recovery solutions track these values and register the data using the application response measurement (ARM) API, an emerging standard for measuring end-to-end response times of transactions in distributed environments. Application programs that use the ARM API act as sources of response time information for ARM-compliant system management and monitoring tools. These tools log ARM transaction information in repositories for subsequent service level reporting and analysis to provide data to spreadsheets.
Service management capabilities of backup and recovery solutions exploit data sources that can be leveraged into service management and availability planning activities. The data also helps identify and isolate potential or ongoing issues in delivering backup and recovery services.
In addition, when combined with enterprise management solutions, data backup and recovery solutions provide robust support of service level monitoring and management activities. Through service level management, IT staff can better meet the requirements of their customers and focus on new services and applications.
Service Management Reporting Scenarios
In this scenario, we look at the use of service management data and data from the backup infrastructure to analyze the cause of a backup duration SLA violation (see Figure 1).
By correlating the appropriate metrics, it is obvious that an increase in the usage of the network segment between the SAP database and backup servers is the cause of the problem. This increased usage reduces the network throughput for the backup and results in an unacceptable increase backup time. Further analysis would determine and rectify the root cause.
This scenario examines the use of service management data to proactively predict SLOs and backup infrastructure limits (see Figure 2). Also graphed are the forecasted values for these two measures for the next six months, the SLO for the backup (10 hours), and the physical capacity of the backup device currently being used (a 120GB optical jukebox).
The analysis provides two valuable pieces of information. Based on historical trend analysis, the time taken to perform the back up will exceed the agreed SLO at the beginning of September--in six month`s time. In addition, the volume of data backed up will exceed the capacity of the optical jukebox in four months.
To prevent this situation from occurring, IT management must alter its backup infrastructure. An upgrade to a higher-capacity backup device addresses the storage capacity problem. However, further analysis is required to address potential throughput issues. System management and monitoring tools allow for quick analysis. IT management might also perform a similar analysis using time-to-restore data to ensure they can continue to meet the time-to-recover SLOs.
A final scenario involves the use of backup and recovery data in cost accounting and chargeback activities. Service management capabilities include the logging of the volumes of data processed by each backup operation. ARM-compliant collection software can log this data. Once logged, the information can be exported to a cost management tool.
Consider a sample excerpt from a cost management billing report. The costs shown are for illustrative purposes only. The actual cost rates would typically include items such as software and hardware depreciation, media and labor costs, and fixed infrastructure charges (such as computer suite floor space and power), and might be broken out separately. Fixed tariffs for the monthly provision of each service (in addition to the per-usage costs) might also be levied, depending on the terms of the SLA.
Figure 1: Service management data can be used in conjunction with backup data to analyze the cause of a backup duration violation.
Figure 2: Service management data can be used to proactively predict when SLOs and backup limits will be reached.
Juergen Ketterer is storage management marcomm manager and Martin Haworth is operations and availability solutions support manager at Hewlett-Packard Co.