Henry Newman's Storage Blog Archives for September 2011

(Another) POSIX Rant

I regularly rant about standard POSIX commands (open, read, write) and the C library equivalents of (fopen, fread, fwrite), all to no avail. The OpenGroup has no plans to change the work it did oh so many years ago. If you just graduated college this year, much of the work was done before you were born, and there have been no changes since then. Name anything else in computing that has lasted that long and not changed?

The fact that POSIX standards are limiting standardization of so many aspects of I/O, such as storage security, archival management, atomic and non-atomic operations, irks me. It reminds of that credit card company where the person asks to book a ticket, and David Spade says No, No, No.

The question I ask is, why no? Don't the people that run the group realize these issues are critical to the user community? Or maybe they do and the vendors do not want standardized solutions, as that way they can sell products and services. I know this is a pretty cynical way of looking at the world. I believe sooner or later the Linux community will address this area, and the OpenGroup will be playing catch-up. Someone with hundreds of thousands of pictures or video will develop a solution, providing an open source interface, and then the OpenGroup will be playing catch-up.

It continues to boggle my mind that there have been no changes to support the above area and information lifecycle management. A simple interface and addition to POSIX-extended attributes is all that is needed. All we have to do is get everyone to agree to what each attribute means and a common interface method, and then change tools like cp and ftp to access the attributes and move them from file system to file system.

Labels: Linux, standards, POSIX, Storage

posted by: Henry Newman

Storage Features: Where Are They?

I know the economic situation has been poor for three-plus years, but a host of technologies are needed to address critical issues, and these technologies were all promised before the downturn. The two critical ones in my opinion are ANSI T10 Data Integrity Field AKA Protection Information (DIF/PI) and declustered RAID. The T10 DIF passes a checksum from the HBA to the disk, which is checked at the disk and back at the HBA on the return. It works only with SAS or Fibre Channel drives and cannot work with SATA drives. With the addition and validation of this checksum, the potential for silent data corruption goes way down. My understanding is about seven orders of magnitude better detection of silent data corruption.

Declustered RAID is in the same boat. I wrote about this over two years ago in "RAID's Days May Be Numbered". In that time, not much has changed except disk drives have gotten bigger, and we are closer to hitting the wall and going splat. With 4 TB drives announced and nearing the market, I think vendors that do not have declustered RAID are about to get a rude awaking when data loss will be commonplace.

Both of the features have been promised by many vendors time and time again, with little to no movement forward. I totally get the economic situation, but just making new denser faster products does not help underlying reliability and data integrity issues. I am hearing more and more cases of data loss with failures of RAID-6 LUNs. The solution is well known and has been thoroughly discussed. It is just waiting for the vendors to solve. The question is when? We must to require these technologies when we write specification or at least an upgrade path.

Labels: RAID, fibre channel, storage technologies

posted by: Henry Newman

Over 1M IOPS and 12 Gb SAS

I do not know if you saw this press release from LSI. I must say it is an impressive number. This performance is for SSDs, and as I have said in the past, this kind of performance is likely going to challenge traditional external RAID storage, given the price/performance, especially for parts of the midrange market.

But what I have on my mind is that if SSD vendors are depending on PCIe 3 and beyond for performance, the limitations are not SSD performance but PCIe and OS performance. I am not sure how LSI got the 1M IOPS if it was a legitimate real-world test or some SBT (slimy benchmark test--as a reformed benchmarker, I understand the desire and methods for SBTs as I did them myself). My question is how can an operating system do 1.2 million IOPS efficiently and do any other work? OS interrupts are costly in terms of time, and only so many can be done. Now this is nothing against LSI. I am DARN impressed that their chip can deal with the throughput, but what I would like to know is how this translates to real-world performance on Linux or Windows with real applications like a database.

What I am not sure about is, can an operating systems scale to match these impressive hardware numbers? Having 1.2 million IOPS, say 4K random, equals only 4.5 GB/sec, which is pretty high utilization. Let's say each OS interrupt takes 15,000 clocks, which I think is very low. That equals 18 billion clocks of interrupt time for the 1.2 million IOPS. Well you see the picture. Something must change.

Labels: SSD, PCIe, PCIe SSDs, Storage

posted by: Henry Newman

Follow-up: PCIe 4.0, Not a Good Plan

I want to do a bit more ranting about PCIe 4.0 plans. Having a doubling of performance between now and 2015 will stifle innovation. Maybe that is the plan--have everyone think you are happily moving down the PCIe yellow brick road and then do something different. Something must be done to improve communications between CPUs. It has become abundantly clear that building memory systems with more than 4 sockets get very expensive, given the engineering design costs.

However, there are many problems that require many CPUs. Things like search and index have used Hadoop and connected lots of cheap CPUs. But many complex scientific and engineering problems cannot decompose problems such that communications is not a barrier to getting the work done. You are not going to run a complex earthquake, atmospheric or other simulation today without high-speed communications. Doubling performance with IB by 2015 (my bet is at least 2016) is not doing to do it for advancing science, unless of course everyone makes some breakthroughs in the area of algorithms that do not require communications. I am thinking that some vendor has something up its sleeve, and presto--we will have high-speed communications outside of the PCIe framework. While search engines do not depend on this, scientific advance does.

I am hard pressed to believe that some of the major vendors have not realized that and are holding back to see if they can gain some market advantage. Since one of the Chinese supercomputers had its own interconnect not requiring PCIe, it is clear to me that someone will get it. I cannot believe that the major CPU vendors do not. The new IBM P775 and current Cray XE/6 systems have addressed this problem

Labels: Hadoop, PCIe, Storage

posted by: Henry Newman

PCIe 4.0, Not a Good Plan

I just saw that the PCI-SIG (PCI Special Interest Group) announced PCIe 4.0 will double PCIe 3.0 performance and--sit down for the next part--it will arrive in 2015 or 2016. Given how late PCIe 3.0 was, I would say that 2016 would be optimistic at best. So from 2004 with PCIe 1.0 to 2016, we will go from 250 MB/sec per lane to 2 GB/sec per lane or a factor of 8x in 12 years. Big deal.

Moore's Law is certainly not in play for PCIe. During that time, everything else has increased far more than 8x. In 2004, we had LTO-2 at 35 MB/sec; today, we have tape drives well over 240 MB/sec. Surely, we will have another generation faster. We have faster memory, faster CPUs faster communication with 10 GbE and 40 GbE coming (100 GbE depends on PCIe 3.0). The only thing in the stack that is not that much faster is spinning disk, but we of course have flash SSDs that are far more than 8x faster.

The message from the PCI-SIG is basically saying, "wow, we are doing just a great job getting you 2x performance by 2015 or 2016." I see this as a major problem. PCIe is required for all communications, and it just is not fast enough as things scale out, whether for storage or for network communication. It is time for the PCI-SIG to realize we need something much faster. 2x every four years is not fast enough, and we are already pretty far behind since PCIe 3.0 was so late that it is not even here yet.

Labels: PCIe, Moore's Law, Storage

posted by: Henry Newman

Future Development

Since the 1940s, the U.S.A. has been dominant in the development of most of the basic computational hardware technologies used around the world. Examples include CPUs, memory, disk drives and tape channels. We all know that much of the manufacturing no longer happens in our country, and most major companies, such as Intel, AMD, Micron and Seagate, manufacture their products outside the U.S.A.

My question is, will the U.S.A. continue to be the major center for ideas that are used in computers? Will the U.S.A. be the home for, say, the development not the manufacturing of CPUs from 2020-2100? I honestly think this question has something to do with politics. Sadly, I am going to step in a mess here.

We must have a great investment in education and allow foreign PhDs to stay and have more U.S. citizens getting PhDs. We need more basic research, as companies rarely fund basic research. Do you think we would have gotten the Internet without basic research and U.S. Government support? No, we would have gotten many different networks that did not communicate well with each other. Just look back to the 1970s and see what IBM, DEC, CDC, HP and others were doing.

We do not need to look far to find countries that realize that basic research is not funded by companies, and the tools for that research are not just hardware but people and the educational system. The days are over that we can take for granted the fact that the U.S.A. will continue to lead. We must look at why we lead, and in my opinion, having the best educational system, people that want to use it, and investing in basic research makes or breaks our future.


Labels: manufacturing, H1-B, technology trends

posted by: Henry Newman

Tape, Here We Go Again

I was recently told that both tape and hard drives are dead. I thought that had dispelled that notion in The Evolution of Stupidity: Research (Don't Repeat) the Storage Past, but I guess some out there have not read this article.

I recently was sent a detailed technology study of disk, tape and NAND flash futures from someone at IBM. I found it very interesting reading; not because it proves my point, but because the research is pretty difficult to refute. I am sure there were papers of a similar nature written back in the early 1990s, but I cannot seem to find any of them.

Unless there is some major technology breakthrough, I do not see how storage hierarchies of tape disk and flash change. Major breakthroughs generally cost huge amounts of money to develop and manufacture the technology, and both of these are lacking in the current worldwide economy. Besides money, major breakthroughs require a company take a risk, and most companies today are very risk adverse to say the least.

Technology development requires three things:

  1. A company that will take risks
  2. A company that has the capital to bring the technology developed to market
  3. Smart people who can develop the technology and bring it to market

A number of technologies developed cannot be manufactured. Without all three things coming together at the right time, the fourth thing--customers buying the product--never comes about. The technology cannot be so radical and so costly that it is not affordable or is just too disruptive to the customer.

Follow InfoStor on Twitter

Labels: tape storage, storage technologies, disk storage, NAND Flash, Storage

posted by: Henry Newman

Does Block Storage Have a Future?

I have been thinking about the state of file systems and big block storage (aka midrange and large RAID arrays), and I am wondering if block storage has a future. My issue is that block storage depends on file systems for the most part, except that some databases can manage their own storage. To be sure, many large storage arrays are broken up into much smaller LUNs, and file systems are made using these smaller LUNs. But what if the file systems are not scaling?

How old, for example, is NTFS? Any guesses? It started in 1993. Do you think we have had some storage changes in 18 years? Does the metadata scale? Block storage and file systems have a symbiotic relationship. Block storage mostly depends on file systems for access, and file systems depend on block storage to provide LUNs. Both need to scale for the relationship to work.

The problem as I see it is that file systems are not scaling, and since enterprises generally are the ones buying block storage for reliability and management, and enterprises generally have larger requirements, the symbiotic relationship it at risk. This in my opinion is why appliance-based storage is gaining more and more traction. More vendors are embedding file systems into storage and ensuring that they scale.

I suspect the most common file system on the planet in terms of instantiations is NTFS, and we all know that it does not scale well. Without changes from the file system community to provide high-performance access, more and more users will be moving to appliance storage. This should not surprise anyone.

Labels: file system, storage appliance, Storage, ntfs

posted by: Henry Newman

Moving Data or Moving Code

The current HPC model is to move data to processing. Parallel programming libraries, such as MPI (Message Passing Interface), allow an application to write or read data across the nodes in the cluster. This programming paradigm has worked well for about the last 10 years, but like paradigms before that, such as vector processing, it is showing signs of age.

The problem as I see it is that there is no easy way to connect networks into commodity CPUs at the rates needed. Let's assume that a vendor wanted to connect directly to the AMD HT (Hypertransport) or Intel QPI (Quick Path Interconnect). Vendors will find that both technologies have not scaled that well with memory bandwidth. Doing these types of direct connections for interconnect vendors is significantly more difficult than plugging into a PCIe slot with an adapter, and it is not done very much these days given the engineering costs, the benefits and the time to market, and time in the market for a product before the next one comes along and supercedes it.

What I think is going to have to happen is that parts of the executable will have to be moved to the data that is to be moved. Instead of moving, say, 300 MB of data from one CPU to another, why not move 1024 bytes of code to the CPU that has the data? This, of course, will not work when everyone must have the same data (broadcast); nor will it work if you need to add a value across all nodes. But it will work for many things. It took 10 years or so for parallel programming to become standardized and widely used (from 1990 to 2000), and I think we will see concepts such as I described to be mainstream in about another 10 years.

Labels: data storage, Parallel processing, Storage

posted by: Henry Newman