I was just reviewing some blog postings and decided to read the "Best Paper" for Usenix FAST 2011. The paper, titled, "A Study of Practical Deduplication" was written by William Bolosky of Microsoft Research and Dutch Meyer of British Columbia. The paper, while focused on deduplication, also revelaed some interesting things about trends in our "data."
The integeresting trends were discussed by Robin Harris at zdnet. The trends he noticed were:
- The Median file size was not changing
- The average file sizes are larger
- The average file system capacity has tripled
- The variety of file types is increasing
The first observation, that the median file size was not changing, is very interesting because we all know that files have gotten larger, particularly media files such as movies and music. However, for the median file size to stay the same that means that there much be a large number of small files (ouch!).
The second trend, that the average file size has increased, is also very interesting, especially in light of the first trend about the median file size. According to the paper, the average file size hsa increased to 318k over the last 10 years.
The third trend is not surprising because the capacity of drives has greatly increased over the last 10 years or so. According to the paper, in the year 2000, few Windows machines had more than 50GB in capacity in the file systems. The average today is about 194GB of capacity. This is at least a 4-fold increase in 10 years. However, I'm still surprised by the small average capacity of file systems given that we have been able to easily buy 1TB drives for seveal years.
The fourth trend is also interesting from a Windows perspective (not my area of expertise). The study found that the top 10 most popular file extensions account for less than 45% of the used file system capacity. In the year 2000, that number was over 50%. The study did also mention that the most common file extension in Windows is none.
Even if you're not an every day Windows user, these trends are very interesting. If you take them in aggregate it points to more smaller files and more larger files, without much in the middle. Couple that with the increased capacity of file systems and now file systems are under great pressure to perform well for large files, which most already do, and also for lots of small files, which most file systems don't handle well.
In addition, these trends also point out that perhaps SSDs are not going to totally replace hard drives. if capacities are increasing rapidly then SSDs are less likely to be the only storage media in systems (where will we store all of our stuff?). However, I also think it points out that coupling SSDs with current file systems allows us to address the small file problems.