The current HPC model is to move data to processing. Parallel programming libraries, such as MPI (Message Passing Interface), allow an application to write or read data across the nodes in the cluster. This programming paradigm has worked well for about the last 10 years, but like paradigms before that, such as vector processing, it is showing signs of age.
The problem as I see it is that there is no easy way to connect networks into commodity CPUs at the rates needed. Let's assume that a vendor wanted to connect directly to the AMD HT (Hypertransport) or Intel QPI (Quick Path Interconnect). Vendors will find that both technologies have not scaled that well with memory bandwidth. Doing these types of direct connections for interconnect vendors is significantly more difficult than plugging into a PCIe slot with an adapter, and it is not done very much these days given the engineering costs, the benefits and the time to market, and time in the market for a product before the next one comes along and supercedes it.
What I think is going to have to happen is that parts of the executable will have to be moved to the data that is to be moved. Instead of moving, say, 300 MB of data from one CPU to another, why not move 1024 bytes of code to the CPU that has the data? This, of course, will not work when everyone must have the same data (broadcast); nor will it work if you need to add a value across all nodes. But it will work for many things. It took 10 years or so for parallel programming to become standardized and widely used (from 1990 to 2000), and I think we will see concepts such as I described to be mainstream in about another 10 years.