Technically, yes, a teraflop is a teraflop, and is directly comparable. It just ...

dr_zoidberg · on April 22, 2016

I agree on the memory model being the most interesting thing about this card. I sort of "under-sold" it on the "better design" part of my last bullet.

People/manufacturers tend to look at clock rates, fill rates (for GPUs), FLOPs, "crunching power" in general, forgetting completely the memory part. For example, today most CPUs end up being bound by cache sizes and performance tuning focuses on being nice on the cache rather than being optimal in your instructions (see for example Abrash's Pixomatic articles[0-2], which are about high performance assembly programming in "modern environments").

With GPU and "classic" HPC (don't know about the new systems with the "compute fabric interconnects"), memory usually becomes the bottleneck (except for embarrasingly paralell problems, of course). In fact, I'm pretty it was Cray who said that a supercomputer is a way to turn a CPU-bound problem into an IO-bound problem.

[0] http://www.drdobbs.com/architecture-and-design/optimizing-pi...

[1] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

[2] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

semi-extrinsic · on April 21, 2016

This. Anything with global interactions (i.e. low flops per byte transferred from memory to core) is poorly suited for GPUs.

There is a hierarchy of HPC-type workloads called "Colella's seven dwarves" that ranks different workloads in terms of being CPU bound or memory bandwidth bound. See also the "roofline model". Both of these heuristics are made to reason about CPUs, but are also effective for thinking about GPUs.