Henry Newman's Storage Blog Archives for June 2012

The Problem With Requirements Gathering

I am constantly discussing requirements gathering with people that seemingly want to look at only part of the requirements rather than the whole end-to-end workflow. Each group in the architecture wants to look only at its own requirements.

For example, server folks want to look at how many cores are needed and how much memory is required, often not looking at memory per core or memory bandwidth requirements, nor the number and types of PCIe buses needed. The I/O and file system people often do not think about the amount of memory needed for file system cache, the layout of the server in terms of buses, and the number and types of connections. On top of this, the application requirements from the users and the workflow is often not well understood nor even described well. How can anyone develop an architecture without a good understanding of how resources will be consumed?

This leads me to think the problem stems from organizational behavior. We all seem to want to play in our own sand (SAN) box (forgive the pun), and yet everyone expects the resulting system to meet the requirements of both the users' applications and the people managing the systems. Somehow, everyone seems to forget that it does not work that way most of the time, and the results do not seem to provide the best value to everyone. Working together in a large organization requires everyone to provide information, and everyone to give up information and control for the benefit of the whole.

Have we all become too selfish to look at the group benefit, which, of course, is the organizational benefit. Changing the way we think requires more than leadership from the top today; it requires a complete rethink of how society behaves.

Labels: data center management, requirements, infrastructure management

posted by: Henry Newman

Big Data and Graph 500

As you might know Jeff Layton and I are writing a series on Big Data. I found it interesting that the Graph500states that:

Data intensive supercomputer applications are increasingly important for HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.

When it comes to running the benchmark though, the benchmarks are all run incore. The reality of Big Data is that the data is not always in memory, nor for most systems can it fit into memory, as most people cannot afford memory sizes in the range of petabytes. However, some of the problems for data analysis are in the range of petabytes, and no one benchmarks running to disk, as the performance will not be that good. Additionally, consider the need to checkpoint jobs on these large memory systems -- remember stuff does break at the worst time.

This is not to say that I do not agree that we should not have data analytics benchmarks where new algorithms that are very, very communications intensive are benchmarked, and floating point arithmetic, which is emphasized in HPC, is not as important, but leaving out storage means we have two benchmark types again. Storage benchmarks and computational benchmarks, and once again we have little to no overlap showing what a real system can do on a real problem. This is not the fault of the hardware vendors, but our fault for not demanding more realistic measurements of systems.

Labels: benchmark, big data

posted by: Henry Newman

Storage Technology Lost to the Recession

The article I recently wrote about 4 storage technologies lost to the recession has received some interesting feedback.

One reader asked what would have happened to file systems if OSD drives had made it. I believe there would have been a resurgence of file system development to take advantage of the new technologies. File systems such as Ceph would have a much easier time connecting to storage and the data abstraction between the file system and device would be much cleaner.

Another reader asked me what might have happened if PCIe 3.0 had been released in 2010 vs. 2012. I think that had PCIe 3.0 been released in 2010, 12 Gb/sec SAS, 16 Gb/sec FC and 40 Gb/sec Ethernet would have been brought to market much sooner. There was no sense in moving forward more quickly in developing these technologies without PCIe 3.0. The lack of improvement in the external connection technology on which we all depend (PCIe) means that other technologies have to wait, as I stated in the article.

The long-term impact is that storage becomes more of a bottleneck. Storage scaling compared to CPU performance and memory bandwidth is already abysmal. The last thing that is needed is more roadblocks, as that will require people to work around the roadblocks, and that has long-term impacts.

The last email I got was from an old friend. He asked who cares about 3.5-inch drives or 2.5-inch drives -- why do they matter? Rotating 2.5-inch drives have better IOPs per watt and better bandwidth per watt, was my answer. You are trading power for density, but in large configurations you can pack more of these drives into a smaller space and are likely to have better density per cubic foot.

Thanks for the questions.

Labels: recession, budget, storage technologies, PCIe 3.0

posted by: Henry Newman

What Is PCIe Affinity and Why Does It Matter?

I was talking with someone recently who really did not understand why his I/O was going over the network to another PCIe bus and then out to storage. I am not going to mention the vendor, as it really does not matter, but any vendor that has hardware where the SMP is composed of groups of nodes connected to each other via an external interconnect will have this same problem.

This problem occurs when volume managers and file systems allocate things in blocks. Those blocks are distributed across the storage. When you have large SMP systems that are connected to the storage system, the file system blocks are generally associated with a LUN that is connected to the SMP complex via a PCIe bus. Let's say, for example, the file system volume manager and file system allocate 64 KB blocks for each stripe, and I have 64 LUNs connected to 16 PCIe buses. A full stripe of data is 4 MB, and each PCIe bus writes 256 KB of data in the full stripe. The problem is blocks are allocated sequentially in the file system. In the case of a SMP cluster with eight nodes controlling the 16 PCIe buses, this means if I am writing for only one node, I must communicate the I/O to the other seven nodes to write the data. Each node has affinity of the PCIe buses and the storage it controls.

There is no way to solve this problem with block-based file systems. Even if we had an object-based file system, it would work only if the processes reading and writing data were on the correct nodes. I might be able to write all the objects down a single PCIe bus; however, if I want to read the data, I had better run the job on the node with that PCIe bus. That is why this matters.

Labels: PCIe, file system, SMP

posted by: Henry Newman

Disk Drive Performance Becoming Insufficient

The latest crop of nearly SAS enterprise drives (aka enterprise SATA drives) gets between 108 MiB/sec and 132 MiB/sec of performance, while some of the fastest RAID controllers get around 40 GiB/sec. When you do the math on this, you see that the number of disk drives required to sustain this performance equals:

Near-Line Disk Average Performance

Enterprise 2.5-Inch Average Performance

20 GiB/sec controller

189.63

121.90

40 GiB/sec controller

379.26

243.81

If you take these numbers and multiply them by the size of the largest disk drives from those categories you get:

Amount of Storage

Amount of Storage in TB

567

109.8

1140

219.6

This is not a great deal of storage, especially if you are considering enterprise 2.5 inch SAS drives. Only 220 TB of space sustains the bandwidth of the 40 GiB/sec controllers. No, of course you are not going to achieve 100 percent of the bandwidth of the drives, and I am using average performance, not the performance of the inner cylinders. However, the picture is clear in my opinion, and controller bandwidth has not kept up with disk performance and density. The question is, what does this mean for the future design of controllers? Ten years ago we still had large SMP machines from a number of vendors.

My understanding is that the biggest cost for those machines was designing the memory interconnect. Today, most of those machines have been replaced by machines with NUMA memory. Interconnect performance is far less than a standard SMP of yesteryears. The new Quadcore SandyBridge/Romley systems have great I/O bandwidth, but I am afraid nothing more will be done for these designs, beyond Quadcore, to improve controller performance. I think we are going the way of the SMP vendors for the future, and controllers will be designed in clusters of boxes based on commodity hardware.

Labels: disk drives, controller, Storage Performance

posted by: Henry Newman

LTO-6 Announced

In case you have not seen it, LTO-6 was announced, and it looks like the native cartridge size is only 2.5 TB. The LTO Consortium stated that the compressed cartridge will be able to store 6.25 TB of data. The only way I know of doing this is to increase the compression buffer size and the size of the data you are compressing against, which is often called the data dictionary. For those of us not using tape for backup -- which we know is often highly compressible, if you look at duplication as an extreme case for a large compression buffer and data dictionary -- we know backup applications can be significantly compressed. However, most archival applications cannot be.

Here is the problem I see: From what I have heard, the biggest growth in the tape market is for archival application, not backup, and many archival data types are either not compressible, as they have already been pre-compressed (e.g., video, music, pictures, images from MR or CT scans) or the data has limited compressibility, such as the output from simulation, which runs from various engineering applications (e.g., car crash testing, aircraft and boat design, and data from oil exploration and discovery). Some of the simulation data has some compressibility, but from what I have seen and have in the past tested, the compressibility is limited to at most 30 percent to 40 percent using huge compression buffers.

I would really like to see the data and understand why the LTO Consortium made the decision to approach the problem the way it has. What went into the decision process? What data sets were looked at to make the decision and the claim? What market spaces were considered? I am very confused how density was achieved, not via media and head density but via compression, given the long-term growth market for tape.

Labels: backup, tape, tape backup, LTO

posted by: Henry Newman