Lately, I've been getting requests from various people for "data management plans." It all sounds really great -- some sort of plan that manages our data. Somehow this is all supposed to make our life better. The problem is, I don't know what "data management" is (and I defy anyone to come up with a simple definition with no examples). So I started reading and starting thinking about what data management is and what it means.
I started by examining what data management means for the various stakeholders. What does it mean to the CIO or CTO? What does it mean to the IT directory? (In my particular case, it's the director of HPC.) What does it mean to the admin(s)? What does it mean to the users? What does it mean to the funding agencies? In my opinion, data management means something different to each of these people--frustrating but understandable.
So the next step I took was to break down the problem, and here is where I think I made more progress. A wise person once told me that before you can manage something you need to be able to measure it. The really cool corollary to this is that you need to have a process before you can automate it. In the case of data management, I took this to mean that we need to be able to monitor or measure "data." Once again, it sounds nebulous, but I think there is actually some meat there.
Basically, we need to think of our data as a living, breathing thing. It grows and evolves over time, and we need to have the tools that enable us to monitor it. This does not mean we need to be able to monitor the performance of our storage devices. While noteworthy, that's not really a data management function. Instead, this means something like the following:
- What is the average age of our data, and how is the average changing over time?
- What is the standard deviation of the age of our data, and how is that changing?
- What does the plot of "age" of our data versus the number of files look like? How is this changing over time?
- Which user has the most data?
- Which user has the oldest data?
- What is the oldest file?
- What is the largest file?
Just to make things a little more interesting, POSIX gives us three ages (ctime, mtime, atime). So what do we mean by age?
To start developing a data management plan, we first need to develop tools/processes/metrics to enable us to monitor our data. Let's get cracking!