I was talking with someone recently who really did not understand why his I/O was going over the network to another PCIe bus and then out to storage. I am not going to mention the vendor, as it really does not matter, but any vendor that has hardware where the SMP is composed of groups of nodes connected to each other via an external interconnect will have this same problem.
This problem occurs when volume managers and file systems allocate things in blocks. Those blocks are distributed across the storage. When you have large SMP systems that are connected to the storage system, the file system blocks are generally associated with a LUN that is connected to the SMP complex via a PCIe bus. Let’s say, for example, the file system volume manager and file system allocate 64 KB blocks for each stripe, and I have 64 LUNs connected to 16 PCIe buses. A full stripe of data is 4 MB, and each PCIe bus writes 256 KB of data in the full stripe. The problem is blocks are allocated sequentially in the file system. In the case of a SMP cluster with eight nodes controlling the 16 PCIe buses, this means if I am writing for only one node, I must communicate the I/O to the other seven nodes to write the data. Each node has affinity of the PCIe buses and the storage it controls.
There is no way to solve this problem with block-based file systems. Even if we had an object-based file system, it would work only if the processes reading and writing data were on the correct nodes. I might be able to write all the objects down a single PCIe bus; however, if I want to read the data, I had better run the job on the node with that PCIe bus. That is why this matters.