For those of you that have not seen it Japan is on top of the supercomputing list for having the fastest machine in the world for running Linpack. Those of us in HPC know that Linpack is not a good measure of what type of science can be run on the machine, given that it does not measure things like interconnect bandwidth and latency and I/O performance.
One thing that seems to be forgotten in the supercomputer race to have the most Linpack FLOPs is I/O performance. Now some reading this might say who cares about supercomputers, but let me remind you that ever drug you take, plane you fly, car you drive was designed on a supercomputer. So supercomputers impact many aspects of our daily lives, and do not forget the financial industry using them for trading.
System balance is not a consideration for the industry, and that is hurting the science that can be done as the arms race for many organization is to see where you fall on the Top500 List. This is not good for the scientists who actually have science to do and need more than just FLOPS. Nodes fail, so it is critical for jobs to checkpoint themselves so they can restart in the event of a node failure. For example the K machine has 22,032 four-socket blade servers with either 32 GB or 64 GB of memory. Let's say a big job runs on ¼ of the machine's 5508 nodes with 32 GB of memory per node or 176256 GB of memory. Let's say you had 100 GB/sec I/O rate. Checkpointing would take 1762 seconds or over 29 minutes. Clearly checkpointing a job every hour or two is not tractable; nor is every three or four. We need to start looking at balance rather than a FLOPS arms race.
posted by: Henry Newman