The basic metric for this kind of comparison is how long can you get your 'compute' node to work on a part of a problem without any new input data and without any intermediate results that need posting for other parts of the code to continue (rendez-vous I believe these are called).
The longer that time the better suited the problem is for a massive parallel solution.
If the time is low relative to the IO that needs to be done then you'll find very soon that the bus that carries data between the host CPU and the number cruncher is the bottle-neck.
The big problem is that you can work with some fairly large amounts of data and still have the bus be your bottleneck. I was doing some GPU work a few summers ago that focused on matrix multiplications. We were sending matrices with 8k numbers on a side to the GPU for multiplication and still ending up with the bus being the slowest part of the computation.
How much RAM was on the card? 64bit numbers * 8000^2 = 512 MB. Granted today you can have 4GB per card, but back then you where probably stuck with a fraction of that.
Still, PCIe 2.0 x16 is limited to 8GByte/s so I guess the real question is how many matrixes where you multiplying?
The longer that time the better suited the problem is for a massive parallel solution.
If the time is low relative to the IO that needs to be done then you'll find very soon that the bus that carries data between the host CPU and the number cruncher is the bottle-neck.