What I got from the article: \* ~5.5 TFLOPs on FP64 \* "About 2x" performance on...

ghshephard · on April 21, 2016

How does performance like this compare to the TFLOPs on a supercomputer? I.E. how does one draw parallels to the performance you can get versus what we've been historically able to do with http://www.top500.org/statistics/perfdevel/?

stuntprogrammer · on April 21, 2016

Pascal won't be cheap so comparing to a top-end Xeon E5 v4, it's about 7x the theoretical FP64 performance (assuming the Xeon is 2.2GHz * 22 cores * 8 avx pipes per * 2 for FMA per socket, given the price range). Similar story at FP32. However, the GPU wins on FP16.

For historical performance, just pick one of the machines that did x teraflops. E.g. the first teraflop computer used ~6000 200MHz pentium pro chips around 1996.

ghshephard · on April 21, 2016

What I'm trying to figure out, is whether a teraflop directly comparable. That is, on the top 500 list, the first 4.9 Teraflop computer was in 2000, but does that mean that Pascal could provide performance similar to the Supercomputer on the LINPACK benchmark?

In describing the benchmark, they say,

In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the system of equations in the benchmark procedure must conform to LU factorization with partial pivoting. In particular, the operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations. This excludes the use of a fast matrix multiply algorithm like "Strassen's Method" or algorithms which compute a solution in a precision lower than full precision (64 bit floating point arithmetic) and refine the solution using an iterative approach.

So, to summarize, if in 2000 the fastest supercomputer on the planet ran at about 4.9 TFLOPs, does that mean, apples-apples on the LINPACK (and only the LINPACK), that Pascal today would outperform that Supercomputer?

stuntprogrammer · on April 21, 2016

Technically, yes, a teraflop is a teraflop, and is directly comparable. It just means you can do an awful lot of floating point operations per second. But many systems are sensitive to memory size, memory bandwidth, and as a result communication costs (i.e. latency/bandwidth of the interconnect between machines).

The benchmark is essentially bottlenecked on FP64 matrix multiplies. If that's what you need to do, then sure, it's indicative.

Some machine learning workloads are also bottlenecked on matrix multiply, but don't need FP64 precision. They can use FP16. Fits a bigger model in a given memory size, makes better use of memory bandwidth, and given the right hardware support, you can get extremely high rates as on Pascal.

Personally, I find the memory system on Pascal more interesting than raw flops rate. Also, the use of nvlink to link multiple GPUs..

dr_zoidberg · on April 22, 2016

I agree on the memory model being the most interesting thing about this card. I sort of "under-sold" it on the "better design" part of my last bullet.

People/manufacturers tend to look at clock rates, fill rates (for GPUs), FLOPs, "crunching power" in general, forgetting completely the memory part. For example, today most CPUs end up being bound by cache sizes and performance tuning focuses on being nice on the cache rather than being optimal in your instructions (see for example Abrash's Pixomatic articles[0-2], which are about high performance assembly programming in "modern environments").

With GPU and "classic" HPC (don't know about the new systems with the "compute fabric interconnects"), memory usually becomes the bottleneck (except for embarrasingly paralell problems, of course). In fact, I'm pretty it was Cray who said that a supercomputer is a way to turn a CPU-bound problem into an IO-bound problem.

[0] http://www.drdobbs.com/architecture-and-design/optimizing-pi...

[1] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

[2] http://www.drdobbs.com/optimizing-pixomatic-for-modern-x86-p...

semi-extrinsic · on April 21, 2016

This. Anything with global interactions (i.e. low flops per byte transferred from memory to core) is poorly suited for GPUs.

There is a hierarchy of HPC-type workloads called "Colella's seven dwarves" that ranks different workloads in terms of being CPU bound or memory bandwidth bound. See also the "roofline model". Both of these heuristics are made to reason about CPUs, but are also effective for thinking about GPUs.

chongli · on April 21, 2016

What's the memory bandwidth of the Xeon E5 v4? NVidia made a big deal of their increased memory bandwidth due to chip stacking.