As more and more data gets created, I think we are going to see more people looking at archiving data. I am at SC 11, and the supercomputing field has always been at the forefront of archival software. It has driven the archival community. Of course, today many believe that Google, Amazon and even Mozy and Carbonite can support large archives, and to some degree they can, but the issue is getting the data in and out. Today, some of the archive I am aware of are over 20 PB. Think about that take a OC-192 channel at around 10 Gbits a second. For rounding’s sake, consider:

20*1024*1024*1024*1024*1024/(10/8*1024*1024*1024)= 16,777,216 or about 194 days. Now first of all, no one gets 100 percent of channel usage, and the values I used are higher than an OC-192 channel, but you get the picture. Today, just about all the archives that I am aware of in high performance computing are local the site or organization. Even with high-speed networking, users moving large files around is not tractable unless there is some way to know apriori when you need a file and a way of scheduling the transfer.

Of course, there are tools that can provide schedule transfers, but users still need to schedule the transfer and therefore need to know they need the data. In some cases this will work; in others it will not. File sizes are growing faster than network performance from what I can see, and that means if you want data quickly from an archive, the archive better not be located over the WAN. Small file access over the WAN will work, due to file sizes for some applications, but of course not everything. I do not see the network performance increasing at the rate of data growth. It never has and likely never will.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *