The Digital Preservation Disconnect

I help put together a meeting on digital preservation for a U.S. Government Agency every year. The goal of the meeting is to bring together the industry with archivists to discuss technology issues. One of the major areas of disconnect is comparing what the two groups think about in terms of data loss. Technology people talk about data loss in terms of 9’s, just like we talk about availability in terms of 9’s. Preservationists talk about data loss as “we do not accept any data loss.” If you lost a book, and that was the only copy, then the book is lost forever. Here is how I think about data loss.

Clearly, no one can afford even 15 9s of data integrity for large data volumes. Data loss, given that storage technology is not perfect and we are not paying for it to be perfect, is inevitable. The digital preservation community, and most importantly its management, must accept this or alternatively pay for the number of 9s that its members want. The problem is that the storage community does not have a consistent way of talking about data integrity; nor do the few vendors that provide it.

As we move to develop common nomenclature within the storage community to discuss these issues, we must also move toward a way to talk about these issues externally. No one has any idea what the bit-wise data integrity is for data moving from a host to storage and back, and given all of the hardware and software involved, it is really hard to test for end-to-end integrity.

As more and more data gets archived, and both sides have unrealistic and different expectations for what will happen, I suspect we are all cruising for a bruising. The first vendor that loses lots of high-profile data will get all the bad press, but it is not likely totally their fault; it is everyone’s.