The Future of Metadata

By Henry Newman

Needless to say, there are lots and lots of formats for files—more than I know about and more than anyone likely knows about. Each of these formats are very specific to the data type and function of the file. Additionally, there are self-describing formats such as HDF-5 (Hierarchical Data Format), which is used a great deal in high performance computing. HDF-5 allows the user to define the format so other users can use the data, and so that after twenty years the users themselves can actually figure out and remember what they were doing.

None of these formats with metadata are searchable by that metadata within a POSIX file system. There are, of course, applications that allow you to index almost every file type of metadata. But these applications are not interchangeable and only work for specific file types generally by industry.

What if an object store could ingest every file type as it had a framework that would allow all known file types to have the per file specific metadata indexed? That would be a pretty killer application and would likely drive out of business a number of software vendors that work in specific application fields with specific file types. I am sure that this is not far from reality, and it gives object storage a very, very significant advantage over POSIX file systems.

What about unstructured data

A great number of people are discussing and writing about unstructured data and how it can be made searchable. We have open source applications such as Hadoop and commercial competitors and implementations from a variety of vendors. Groups like Oasis are trying to tackle putting some structure around unstructured data. When you do this, you change the information from unstructured to structured.

The problem is getting everyone who is a stakeholder to agree to a common structure for a data type. There is a great deal of work being done in this area for all kinds of new data types and interchange formats. The unstructured data of yesteryear might become the structured data of the future—once file formats are agreed upon and someone makes the old unstructured data useful as new structured data. It will likely have to be read in, with the structured data areas created and populated and then written out in the new structured format. This method is also used to update old structured formats to new structured formats.

None of this is fast and none if this is easy, but it is happening all of the time.

The Future

Object storage with REST interfaces has very significant advantages in the area of metadata over the meager amount of information that is available as part of the POSIX file system interface You could argue that thirty years ago when the POSIX interface was first being discussed, this type of metadata was not important. The cost of storing and accessing metadata given the availability and cost of bandwidth and storage space made it far too costly to be able to address. This argument is, in my opinion, correct.

The problem is that that POSIX did not change—and the world did change. The way I see it, in the long term, POSIX file systems and the limited information available are going to be a thing of the past.

You might ask me how long will it take? My guess would be about ten years for the following reasons:

  1. Though REST frameworks and object storage are coming on strong, there still isn't broad industry adoption, and scalability and inter-operatabilty issues have not been fully worked out.
  2. Some of the standards need to be flushed out, especially for access control and other security functionality.
  3. Some POSIX file systems scale to thousands of clients and over 1 terabyte per second. There are no object storage systems that I am aware of that can do both of those things today. This is required by a number of high-end application domains like weather forecasting.
  4. REST interfaces do not allow reading the file/object into an application while it is being transferred, that I am aware of. This is important for some applications which need to process the data ASAP. Applications such as security cameras with image recognition come to mind. This will require a restructuring of many applications.
  5. There are lots and lots of old applications that will require having the I/O sections rewritten to support new interfaces.

There are many good reason that REST will likely dominate in the next decade, but the biggest reason is you cannot just dig your heels in and say, "I am not going to do XY or Z," when the world around you is changing. That might work for a while, but long term it always ends badly in so many different areas.

Photo courtesy of Shutterstock.

This article was originally published on March 20, 2014

Page 2 of 2

1 2
<< Previous Page