Policy management affects all levels of the storage hierarchy, from disk array controllers to high-level storage management software.
By Bob Rogers
Policy management is one of the newest catch phrases in the storage industry. It is being applied to virtually everything relating to storage. Arrays have internal policy management algorithms to manage the cache, switches have internal policy management algorithms for flow control, and volume managers have internal policy management algorithms to manage allocation. But policy management covers even more than this, because it applies to a broad spectrum of functions.
Policies have one or more distinct attributes (see table):
- A definition;
- A method of evaluation or decision of status; and
- A method or set of methods of alleviating or enforcing the resolution to an exception condition.
Why should you care?
Policy management is a way to differentiate service based on one or more attributes. For example, a data center might formulate a policy that a particular disk should start out empty every day because over the course of the day it gets filled. Another example might be to prefer that I/O bypasses cache if the program indicates that it is cache-unfriendly.
Policy management is important because not all data is the same. Every environment has applications that are important and some that are relatively not important. Furthermore, every environment has programs that are friendly and some that are not so friendly when presented with resource constraints. The minimal goal of any policy manager is to classify, evaluate, and enforce requests for service.
Finally, policy management is important because a breach in the authentication of a user (i.e., determining a user's identity) may expose the entire storage configuration.
Policies don't exist in a vacuum
There is a significant infrastructure necessary to support policy decisions: discovery, detection, interpretation, modeling and analysis, actions, and feedback.
Discovery is the process of identifying the environment. What does the storage landscape look like? How much storage is there? Where are the "dense areas," and where are there opportunities to "pack more in?"
Detection refers to the task of evaluation and decision-making (as shown in the table). Is anything "out of spec?"
The interpretation task is more difficult to accomplish. The policy manager (human or software) needs to be able to look at the situation and decide what to do. Is there a spare box somewhere to move data onto? Should an application be throttled back on its I/O so that another user or application gets better service? For a knowledgeable expert, this task might be simple, but for a junior administrator or a software-based policy manager, it is relatively difficult. Your definitions, priorities, and the environment all factor into the interpretation of the issue and the options to alleviate the problem.
The modeling and analysis stage is an extension of interpretation because every good administrator is going to review options and choose the "best fit" solution to the problem. Consider a situation where resources are constrained and several mission-critical applications are affected. Giving all the spare resources to the most important application may not be sufficient to solve the problem, but giving the spare resources to the two or three applications behind it might help. Do you affect all three applications or provision the two applications and miss your service level objective on one? These are difficult decisions for storage administrators and even harder ones for software-based policy managers.
Enforcement and/or mitigation actions come out of the interpretation/modeling/analysis stages. Enforcement actions arise from an application or user who is affecting other users to the extent that their service level objectives are compromised, or from applications that have exceeded their budget of resources.
Finally, the feedback stage is necessary to determine that the issue has been resolved. It is very common to build policies on top of policies so that if one fails, the next policy deals with the failure.
Policies never exist in a vacuum; there are always reasons behind a policy that must be taken into consideration.
Detection is part of the policy management infrastructure and refers to the task of evaluation and decision-making.
Types of policies
There are several types of policies. The simplest form is an authorization policy (e.g., who can do what). The next type is resource management policy, which refers to the way a device or other resource (e.g., disk I/O capacity) might be used. The third, and most complex type, is a data management policy, which is related to applications, business processes, or some other abstract grouping and characterizes the importance of that business element compared to other business elements.
All these policies can be expressed as (and generally are considered) rules. The difference between "rule-" and "policy-based" management arises when you begin to evaluate policies in the data management context. There are more tradeoffs to be analyzed and considered, and the complexity of the decision-making process is much greater.
Authorization policies exist everywhere. For example, a device configurator that requires a login has a simple, effective policy to protect against intruders who might disable, corrupt, or replace installation settings. (The login authenticates the user's identity so that authorization checks can be made.) Security systems such as Kerberos and others have embodied these types of policies for many years. The primary difficulty has been how to rely on anything other than a native, built-in scheme to manage authorization. Only in homogeneous environments like Windows or OS/390 have authorization schemes been adopted that extend to storage policies, and even there, they are used on a very limited basis.
One of the objectives of storage resource management (SRM) software is to insulate the administrator from the difficulty of dealing with tens, hundreds, or even thousands of instances of authentication and authorization needed to collect and manage enterprise storage configurations.
SRM usually deals with authentication by adding its own layer of user authentication, insulating the administrator because it uses a proxy function to impersonate individual users to each of the managed systems "under the covers." This technique works well when the SRM software is simply collecting and distilling information. However, a more complex mapping of authorization properties may be necessary to actually provide management functions.
However authorization policies are handled, it is imperative that they meet two criteria:
- Control over who is allowed to perform a task; and
- Oversee the process of performing a task to ensure it meets installation goals.
The first of these two criteria is almost self-evident; an authorization policy should always determine who should be allowed to perform a task.
The second of the criteria may require more explanation, because one of the objectives of data-center management is the concept of "process consistency" for complex, multi-step, operations. For example, when a new LUN is created, a set process is followed each time so that it is done correctly. This criteria arises from the disasters almost everyone has observed and traced back to some junior on-call person at 2 AM.
Authorization policies are a pre-requisite for resource and data management policies.
Resource management policies tend to have a device focus because devices are usually the most costly resources in an installation. The policies associated with resource management and devices tend to relate to capacity, performance, and availability, and in some cases, reliability.
The capacity policies of a device or resource are related to the cost of ownership. An expensive device should be used to achieve its maximum value. Ideally, a storage device would operate at 100% of its capacity all the time. However, the real world is far from ideal, so an example of a resource management policy would be to operate a device between 90% and 95% of its capacity so that 5% to 10% of its capacity is reserved for unanticipated surges in utilization.
The same criteria might be applied to performance-related policies. Most disk subsystems do reasonably well at servicing I/O as the load increases, until the I/O rate saturates the subsystem. Thus, in terms of resource management policies, it might be wise to be on the watch for this "knee-of-the-curve" I/O rate so that action can be taken.
Availability-based resource management policies can have some interesting permutations. For example, RAID techniques and remote mirroring affect availability.
How to establish resource management polices to maximize utilization is almost a black art. In established environments, past history may be a good indicator of how to set policies. However, where there is no history, or where past trends are misleading, or where the trend depends on the size of the observation window because of workload cycles, there may be no easy way to establish a resource management policy that maximizes utilization.
Once thresholds have been established and you have metrics to evaluate, the next stage is to define actions to take when thresholds are exceeded. Some actions may be as simple as beginning the procurement process for more disk when you are running out of space. However, unless your IT management staff is very flexible you may need to tailor this action to include projections of requirements for two, four, or even six months of disk provisioning.
Other, more subtle, actions may also be required, such as re-balancing performance or capacity across devices or changing configuration options to reassign resources.
Regardless of how the actions are accomplished, there is the notion of policies that define objectives, policies that describe how to assess a resource, and how to alleviate a constraint.
Data management policies usually contain the finest detail because they focus on application, or process-level, services. For example, a bank might specify that ATM transactions need to be serviced in x amount of time. However, x is the total time for the transaction to make the round trip from the ATM across the network, through the server, query the database table, update the balance, and command the ATM to spit out cash. In reality, such transactions may span several servers and cross many networks before getting to the point where there is a disk I/O, so the budget for I/O can get to be very small.
Service level objectives are the key to defining data management policies. These objectives rank the priorities of applications against one another so that during periods of constraint, decisions can be made about how to distribute resources. Service level objectives also set measurable values to quantify service, so that exceptions can be measured and evaluated. Also, there is the question of how to handle policy exceptions. These exceptions can become complicated because each application or set of applications has its own priority and may have dependent downstream applications with higher priorities.
To further complicate matters, devices or other resources may have several "limiting capabilities." For example, a log file may need to be segregated to its own file system, logical unit, RAID rank, and perhaps even the device because it can absorb the performance capabilities of the device.
The process of achieving data management policies requires mapping them over the set of device capabilities and the needs of "neighboring" applications that share resources to evaluate the effect of deploying resources to provision applications.
Solving the policy management gap
There are two groups focusing on methods of addressing policies and policy management in computing systems. The Distributed Management Task Force (DMTF) Policy Working Group is attacking the larger problem of defining service level objectives and implementing evaluation and enforcement mechanisms. The Storage Networking Industry Association (SNIA) Policy Working Group is focused on the storage implications of policies and how to apply policies to storage-related issues. For more information, visit www.snia.org.
Bob Rogers is the chief storage technologist at BMC Software (www.bmc.com) in Houston, TX.