My previous blog entry talked about the need for new types of people who can understand both algorithms and data layout to optimize the generation of actionable information for business and industry. The question I have is: Are schools teaching the current and next generation the right kinds of information so they can develop these skills for the job market?
The data analysis skills with MapReduce, graph analysis, statistics and the like are difficult enough. Add in the data layout for the information to be processed, which requires an understanding of the application reading or writing the data, the operating system, kernel, drivers RAID controllers and the file system, and you have a pretty complex eco-system, one in which there are not many people who fully understand the relationships or how to evaluate the relationships. Something is going to have to give to make it easier.
The move to file system appliances across a broad spectrum of the market from my home PC with an external ISCSI 4 disk RAID, to large parallel file systems moving into the appliance market gives me hope that people are working on solving this problem. These file system scaling and data layout problems are not going to be solved overnight. Most file systems are still stuck looking at files and have little understanding of what the layout must have for those files to be processed into information. I think it will take a very special person to be able to understand all that is needed in the current environment.
In the meantime, the best thing organizations can do is develop teams with the right domain expertise. Having a new data analysis graduate working with someone who knows datapath is likely the best anyone can do. Of course, there will be a handle of people that understand both. I hope I can hire one.
Labels: IT professionals, Data Analysis
posted by: Henry Newman
I was on a panel at the IEEE Mass Storage conference last week, and the main topic was how to converge HPC applications and cloud applications. You might ask, what do HPC applications and cloud applications have to do with each other? The panelists all believe that they intersect in the area of data analysis.
No, we are not analyzing the same data, but the idea of big data analysis has been a common theme in HPC for a long time. Whether it has been analyzing weather forecast predictions or car crash simulations and comparing them to crashing real cars, HPC has always had a big data analysis requirement. As people consider data consolidation to either public or private clouds or even locally changing data into actionable information, they are facing the next big challenge in the IT world in my opinion.
As all of you know, computational performance gains have far outpaced storage performance gains even if you add NAND flash into the equation. If you read my last blog entry, you realize that the cost for NAND will not be dropping to the point we can put all of our data on NAND. This means there is a need for the job description that I would call Data Analysis Architect. The person would understand all of the data analysis techniques, from MapReduce to graph analysis, and statistical methods and would have a good understanding of how to lay out the data so it could be processed to create actionable information for the organization. It might be something like a 15-day weather forecast for a commodities trader or farm insurance broker, or what combination of things work best to bring people into a retail or big box store. Who knows what information can be found from the various nuggets of data if it can only be processed fast enough.
Labels: data storage, analytics, HPC, Data Analytics
posted by: Henry Newman
I am just returning from the IEEE Mass Storage conference and found it to be very interesting. All of the presentations can be found online. Of all of the presentations, the one that I found most interesting was one about the limits of future density for storage technologies.
Dr. Fontana discusses the limits in areal density for various storage technologies, including NAND, HDD and tape. He also discusses the challenges for manufacturing some of these technologies. For example, on page 22 is an excellent discussion of some of the manufacturing challenges hard drive manufactures will face for bit-patterned media (BPM) and heat-assisted magnetic recording (HAMR). This is an excellent discussion of some of the lithography limits for NAND and the challenges for density that NAND will face. Dr. Fontana has written on this topic before and is well known in this area.
So you might ask how and why does this impact me and my organization? Understanding planned density increases impact budgets and often determines the balance between various tiers of storage. There are many claims from various vendors about the future densities that will be available. Some of them are right, and some of them based on this presentation are just wrong. It is also important to remember that increases in areal density often do not directly translate into increases in density in our devices. Sometimes there are cost impacts and sometimes because of reliability issues more ECC is needed and some of the density increase is taken to ensure data integrity.
I think we all believe this to be a good thing. I think everyone should read this presentation, and I encourage you to consider attending this interesting technically oriented conference next year.
Labels: IEEE, storage technologies, NAND, HDD
posted by: Henry Newman
Enterprise SATA/SAS 4 TB drives are here, in case you had not seen. I am sure that RAID vendors will be soon qualifying the drives, and we will see them in storage controllers in a few months or less, as some vendors might have already started qualification. I have a concern that RAID-6 is not robust enough to deal with the density, given the long rebuild times and the potential for a triple failure. I wrote about this about 2.5 years ago, and I have seen little movement in the industry to support declustered RAID across the industry. I would not be purchasing 4 TB drives from any vendor unless that vendor supported some type of declustering algorithm, as rebuild times with 4 TB drives will increase the likelihood for a triple failure with RAID-6. Do not even consider RAID-5. Of course, you could go to RAID-6 4+2 to reduce the impact, but you are now using 12 drives vs. 10 drives. It kind of defeats the purpose, doesn't it?
If I were purchasing storage, I would either demand that the vendor provide declustered RAID support or say that I will buy its product with only 2 TB drives. Those vendors that support declustered RAID will have a major price advantage. The Hitachi data was not clear on the performance -- it said 171 MB/sec sustained rate. I am not sure if that is 1024*1024 or 1000*1000 and if the 171 is sustained for the whole drive or just a portion of the drive. Anyway, using 1024 and 171 MB/sec for the whole drive (just a guess) it takes more than 6 hours to read the whole drive. That, of course, assumes you are doing nothing else. We as customers need to demand 4 TB drives with declustered RAID.
Labels: disk drives, RAID, SAS, SATA
posted by: Henry Newman