As you might know Jeff Layton and I are writing a series on Big Data. I found it interesting that the Graph500states that:
Data intensive supercomputer applications are increasingly important for HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.
When it comes to running the benchmark though, the benchmarks are all run incore. The reality of Big Data is that the data is not always in memory, nor for most systems can it fit into memory, as most people cannot afford memory sizes in the range of petabytes. However, some of the problems for data analysis are in the range of petabytes, and no one benchmarks running to disk, as the performance will not be that good. Additionally, consider the need to checkpoint jobs on these large memory systems — remember stuff does break at the worst time.
This is not to say that I do not agree that we should not have data analytics benchmarks where new algorithms that are very, very communications intensive are benchmarked, and floating point arithmetic, which is emphasized in HPC, is not as important, but leaving out storage means we have two benchmark types again. Storage benchmarks and computational benchmarks, and once again we have little to no overlap showing what a real system can do on a real problem. This is not the fault of the hardware vendors, but our fault for not demanding more realistic measurements of systems.