Some Final Thoughts on Data Integrity

By Henry Newman

This is a call to the industry to make changes to standards and to assume that files are not going to be 100 percent reliable 100 percent of the time. I think it is time to get the bodies that control the standards for various file formats to provide frameworks for easier file recovery.

I have said this before, but I think it deserves repeating again. We need to have a per-file method to address data corruption. Standards that maintain file format (such as tiff, jpg and other media formats) and vendors (such as Adobe, which controls PDF) need to think about things such as multiple file headers, checksums and error correction codes (ECC).

When a file is created, the creator should determine the level of error correction as only he knows how important the file is. The importance of a file could, of course, change over time. At that point, the file should be re-read and re-written based on the new level of importance, and hopefully the file will be intact. We cannot make the expectation that the hardware industry ensure all files are completely protected, as the cost is far too high. The consumer is driving the storage market, and consumer markets do not pay for anything they cannot cost justify; high data integrity might float my boat, but it will not drive the market.

Many applications when you are editing a jpg file allow you when you save it to specify the compression to a small, medium or large size for example. I believe that error correction should have a similar interface. Small might be just a checksum to tell you something has changed, while large might write enough ECC so that the file is essentially mirrored. So far, since I wrote about this about a year ago nothing has happened, and no one is speaking about it as least as far as I know.

This article was originally published on October 26, 2011