2080 Ti TensorFlow GPU Benchmarks

sabalaba · on Oct 12, 2018

Lower in this thread, bitL pointed out that the prices we used in our analysis is not exactly in line with current market prices. bitL hit the nail on the head in terms of the biggest weakness of our post: choosing the price. So, we've decided to make the spreadsheet that generated our graphs and (performance / $) tables public. You can view it here:

https://docs.google.com/spreadsheets/d/1La55B-AVHSv9LiQcs6GM...

You can copy that spreadsheet and insert whatever system price (in kilodollars) you want into B15:F15. Hope this makes everybody's decision making easier.

As a system builder and AI research company, we're trying to make benchmarks that are scientific, reproducible, correlate with real world training scenarios, and have accurate prices.

shaklee3 · on Oct 11, 2018

The fp16 results versus the 1080ti are somewhat surprising. The author specifically pointed out that they are using tensor cores were possible. I would have expected fp16 would be more than 100% faster than the 1080ti if they were using tensor cores. Can anyone explain that?

pavanky · on Oct 11, 2018

Not everything in the compute pipeline is going to be converted to fp16 operations. Anytime you are doing accumulation or exponentials, you would have to have it in fp32.

There was a good talk from NVIDIA at last years GTC: http://on-demand.gputechconf.com/gtc/2018/presentation/s8923...

Here is another relevant blog post: https://devblogs.nvidia.com/mixed-precision-training-deep-ne...

EDIT: Also not everything in the training loop is a matrix multiplication where tensor cores are useful.

shaklee3 · on Oct 12, 2018

Right, but the tensor cores should be about 10x faster on the compute side, and about 2x the memory bandwidth. GEMM is usually constrained by compute, which is why the tensor cores exist.

These benchmarks are for training, so the expectation is that they are running them in fp16 all the way through. Also, tensor cores can accumulate in fp32 registers with a slight hit to performance.

sabalaba · on Oct 12, 2018

Is there a benchmark you've seen that matches the claimed 10x increase in performance for the Tensor Cores? NVIDIA hype train can sometimes make it difficult to find hard numbers.

shaklee3 · on Oct 12, 2018

I've done my own benchmarks where I've hit over 100 TFLOPS on the V100, and that's about 85% of the peak theoretical throughout of them. Granted the matrix size needs to be large enough, but it's definitely doable. Anandtech also showed similar results in their V100 review. I haven't yet seen a comparable SGEMM done on the 2080Ti, so I don't know how it'll compare. I have some coming in at the end of the month though, so I should know soon.

dimitry12 · on Oct 12, 2018

Titan V in my testing only approached 100+TFlops in purely synthetic large matmuls.

With carefully tuned Transformer (matmul-heavy!) I could only make twice as fast as 1080ti (at 4 times the price).

The only undisputable benefit was using double batch size.

KaoruAoiShiho · on Oct 12, 2018

Turing tensor cores are not as fast as Volta ones. You won't see the 10x number repeated on any turing marketing material.

shaklee3 · on Oct 12, 2018

What's your source for that? They should be the same speed, but more capable. You absolutely see the 8x in their marketing: https://devblogs.nvidia.com/nvidia-turing-architecture-in-de...

ece · on Oct 12, 2018

I think a good answer to this is here talking about the tensor cores on the Nvidia Titan V: https://github.com/tensorflow/tensorflow/issues/15897#issuec...

If you can write a better optimized network, go ahead. But like in SSE2 vs AVX2 vs AVX512 benchmarks, FP performance on paper doesn't always translate into better real world FP performance. Now if Nvidia had switched to HBM2 like Google TPU2, it might be different.

howscrewedami · on Oct 12, 2018

It's still about 60% faster than the 1080 ti, which is huge in my opinion.

std_throwawayay · on Oct 11, 2018

For comparison the Phoronix benchmarks: https://www.phoronix.com/scan.php?page=article&item=nvidia-r...

this one is more oriented to the lower end.

sabalaba · on Oct 11, 2018

The TL;DR on this is that the 2080 Ti is the most cost effective GPU on the market today for deep learning.

It is:

- 37% faster than the 1080 Ti with FP32, 62% faster with FP16, and 25% more expensive.

- 35% faster than the 2080 with FP32, 47% faster with FP16, and 25% more expensive.

- 96% as fast as the Titan V with FP32, 3% faster with FP16, and ~1/2 of the cost.

- 80% as fast as the Tesla V100 with FP32, 82% as fast with FP16, and ~1/5 of the cost.

The_rationalist · on Oct 11, 2018

Actually AMD gpus are the most cost effective gpus and have supported full FP16/8 since 2 years. But people are locked in the Nvidia proprietary jail and no one seems to care... That is really sad and against the consummer interest, but also, deep learning will never become mainstream if it can only run on 5% of the world hardware (Nvidia).

kbumsik · on Oct 11, 2018

> Actually AMD gpus are the most cost effective gpus.

No, AMD gpu is zero cost effective because Tensorflow does not support AMD gpus.

> But people are locked in the Nvidia proprietary jail and no one seems to care...

Sounds like you want to blame the users, but this is because Nvidia highly invested on GPGPU and Cuda since more than 10 years ago, while AMD did focus on something else like HSA. It is AMD’s fault.

sabalaba · on Oct 11, 2018

I agree with kbumsik here. AMD only has themselves to blame. They have great hardware and fantastic theoretical benchmarks. Heck, even their SGEMMs are really fast and in line with the 15TFlops of FP32 on the VEGA 64s that we've benchmarked. However, it comes down to software ecosystem and optimizations for common deep learning subroutines.

MIOpen[1] is a step in this direction but still causes the VEGA 64 + MIOpen to be 60% of the performance of a 1080 Ti + CuDNN based on benchmarks we've conducted internally at Lambda. Let that soak in for a second: the VEGA 64 (15TFLOPS theoretical peak) is 0.6x of a 1080 Ti (11.3TFLOPS theoretical peak). MIOpen is very far behind CuDNN.

Lisa Su, if you're reading this, please give the ROCm team more budget!

1. https://github.com/ROCmSoftwarePlatform/MIOpen

lostmsu · on Oct 12, 2018

Or, maybe, a new team.

montalbano · on Oct 11, 2018

> No, AMD gpu is zero cost effective because Tensorflow does not support AMD gpus.

Incorrect, see:

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream

shaklee3 · on Oct 11, 2018

Do you know where we can find some benchmarks?

TTPrograms · on Oct 11, 2018

If AMD developed CUDA transpilers / DL backends / CUDNN equivalents people would switch over in a heartbeat. It's fully their fault for not investing in the software.

If they did that and had a card that got 2x performance / $ or more I would switch in a heartbeat.

JoshuaJB · on Oct 12, 2018

AMD has a CUDA Transpiler (https://github.com/ROCm-Developer-Tools/HIP) and a multitude of other open source libraries through GPUOpen ( https://gpuopen.com/professional-compute/).

The quality and open source nature of their tools has resulted in much of my research group (real-time vision) increasingly and voluntarily moving to work on AMD platforms (we were previously almost exclusively using CUDA).

ratsimihah · on Oct 12, 2018

But the GPU needs to support ROCm, which is quite recent right? e.g. The MBP 2015's Radeon m370x doesn't seem to support it.

gcb0 · on Oct 12, 2018

less recent than a 1080 or 2080ti

ratsimihah · on Oct 12, 2018

2080 Ti is brand new :/

auslander · on Oct 12, 2018

Here is it: https://github.com/ROCm-Developer-Tools/HIP

Also, AMD does not limit FP performance on consumer cards.

twtw · on Oct 13, 2018

> Also, AMD does not limit FP performance on consumer cards.

I don't understand this meme. The consumer cards are different chips with slow fp64 hardware. In what sense is that "limiting" performance relative to the enterprise cards?

auslander · on Oct 14, 2018

On AMD, if I'm not not mistaken, FP16 is twice of FP32, as it should be

"For their consumer cards, NVIDIA has severely limited FP16 CUDA performance. GTX 1080’s FP16 instruction rate is 1/128th its FP32 instruction rate, or after you factor in vec2 packing, the resulting theoretical performance (in FLOPs) is 1/64th the FP32 rate, or about 138 GFLOPs."

https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-...

"FP16 performance is 1/64th and FP64 is 1/32th of FP32 performance."

https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-...

keldaris · on Oct 11, 2018

I agree with your broader point, but stating that Nvidia comprises a mere 5% of the world's hardware is disingenous in a post where you've only talked about GPUs. In actual fact, Nvidia has the overwhelming majority of the high end GPU market, especially in HPC and ML. And as much as I dislike the proprietary lock-in, they have by far the superior software ecosystem.

The_rationalist · on Oct 12, 2018

Actually nvidia isn't the high end leader if you account console APUs (almost all being from AMD). And most gpus in the world are from igpus, apus or ARM SoCs. Dgpu IS a small market, High end Dgpu is a few percents max.

gradys · on Oct 11, 2018

Deep learning generally doesn't run on AMD GPUs, but it runs on a lot more than only Nvidia GPUs.

If we're only counting model training, it runs on CPUs, Google's TPUs, FPGAs, whatever other secret datacenter ASICs are out there, various DL-specific mobile chips, etc.

Way more than 5% of the world's hardware can run inference with deep neural nets, which is the important thing for mass adoption, and definitely more than only Nvidia GPUs can run training.

ratsimihah · on Oct 11, 2018

I haven't had a great experience using anything else than CUDA, e.g. OpenCL. Then again maybe I just lack the skills :x

BoorishBears · on Oct 12, 2018

Do you own AMD stock?

The_rationalist · on Oct 12, 2018

Why this question ? No I don't. My message is objective and openCL/sysCL allow to run on All accellerators, not just AMD nor Nvidia.

BoorishBears · on Oct 14, 2018

Because you started your comment with an objective falsehood, NVIDIA has the most cost effective GPUs as long as CUDA king and AMD doesn’t invest more on the software front

bitL · on Oct 11, 2018

> 37% faster than the 1080 Ti with FP32, 62% faster with FP16, and 25% more expensive.

Maybe in theory, but you can get a new 1080Ti for $700 while 2080Ti is impossible to get under $1200, which makes it 70% more expensive. At that point 2x 1080Ti sounds way better than 1x 2080Ti for Deep Learning to me (up to 22TFlops and 22GB of RAM).

sabalaba · on Oct 12, 2018

Hey bitL, thanks a lot for pointing this out. I think you hit the nail on the head in terms of the biggest weakness of our post: choosing the price. We've decided to make the spreadsheet that generated our graphs and (performance / $) tables public. You can view it here:

https://docs.google.com/spreadsheets/d/1La55B-AVHSv9LiQcs6GM...

You can copy that spreadsheet and insert whatever system price (in kilodollars) you want into B15:F15. Hope this makes everybody's decision making easier.

As a system builder and AI research company, we're trying to make benchmarks that are scientific, reproducible, correlate with real world training scenarios, and have accurate prices.

vaidhy · on Oct 12, 2018

Are 2 1080TI better than a single 2080Ti?

ydau · on Oct 12, 2018

Yes. 1080 Ti do not scale linearly, but in most of the benchmarks I’ve seen, two 1080 Ti are at least 1.75x faster than one when doing multi-GPU training.

1.75 / 1.36 (speed up of 2080 Ti over a single 1080 Ti) = 1.28. So expect 2x 1080 Ti to be about 30% faster.

You can see how multi-GPU training works with Titan V benchmarks in the link below. 1080 Ti have similar scaling profile.

https://deeptalk.lambdalabs.com/t/benchmarking-the-titan-v-v...

vaidhy · on Oct 12, 2018

Thanks.. This is really helpful.

ydau · on Oct 12, 2018

Sure thing :)

sabalaba · on Oct 11, 2018

Not in theory, in practice, these are actual benchmarks. Your central argument is that prices within the first few weeks of a new product launch are higher than MSRP. That point will likely not hold true 30 days from now. It makes perfect sense to compare the products based on their MSRPs.

p1necone · on Oct 11, 2018

He's not talking about the benchmarks, he's talking about the prices - which as you pointed out are theoretical.

It's been a long time since GPU prices have been anywhere near MSRP, it seems dishonest to assume that they will be in the near future.

zeroname · on Oct 11, 2018

The 1080Ti is back at MSRP and the cryptocurrency demand has vanished, so what's left is the demand displaced by the high prices earlier.

In the past, it has been common that new hardware with low initial supply fetched premiums that lasted until supply met demand. It's not dishonest to assume that it's the same case here.

sabalaba · on Oct 11, 2018

Exactly. We're sort of between a rock and a hard place in terms of choosing a price.

While we could decide to write some software that could pull today's market prices and update tables and graphs dynamically, we just decided to settle on a single number. If you are settling on a single number, the choice is either market price on date of publication or MSRP. Given that other GPUs tend towards their MSRP as time goes on, we decided to choose MSRP.

freediver · on Oct 12, 2018

Your article still states that 2080ti is 25% more expensive, but lists 1080ti at $700 and 2080ti at $1,200 (which are correct prices from nvidia website) which is 70% more. Which one is right?

zeroname · on Oct 12, 2018

It states that the "typical GPU system" (estimated at ~1300$ without GPU) will be 25% more expensive.

By the way, the MSRP for a regular 2080Ti is 999$, the 1200$ is the Founder's Edition price.

ydau · on Oct 12, 2018

MSRP may be $999, but you won’t find them for much below $1,200 for the time being.

GPU modules are manufactured in China. Their harmonized codes are covered in recently established tariffs. 10% tariffs are already hitting cards arriving at US ports. This tariff will increase to 25% on Jan 1.

Prices will stay well above MSRP.

knicholes · on Oct 11, 2018

You mean it MAY not hold true 30 days from now. Last year around this time the cryptocurrencies started spiking. Perhaps demand for graphics cards will go up again, and this time, maybe the demand for the 1080 Tis won't match the demand for the newer, shinier 2080 Tis.

sabalaba · on Oct 11, 2018

Fair feedback, I've edited parent.

1080 Tis are back under MSRP right now and have been for some time. We decided to assume that 2080 Tis would exhibit similar behavior.

guiomie · on Oct 11, 2018

Unless crypto mining picks up again.

p1esk · on Oct 11, 2018

Note that these results are only for convnets. Has anyone posted RNN results?

sp332 · on Oct 12, 2018

For one GPU, sure. But the cooling design is going to be awful if you try to pack a bunch of cards in a 4U or 5U box.

ydau · on Oct 12, 2018

As others have stated, blower cards resolve this by pushing air directly outside of the case. 2080 Ti blowers are in the works:

https://www.pny.com/GeForce-RTX-2080-Ti-11GB-Blower

_Wintermute · on Oct 12, 2018

Pretty sure the ASUS TURBO models are already out, which are blower style cards.

moonrobin · on Oct 12, 2018

Is this because of the reference dual axial open style cooler design? If so there are blower style cards available from AIBs for a marginal discount.

sp332 · on Oct 12, 2018

Yup, just puts the hot air back into the box. I was hoping that someone would put a blower on it but hadn't seen one yet, good to know.

londons_explore · on Oct 11, 2018

It would be good to see it compared to a TPU...

I feel like thats becoming tensorflows 'native platform'...

zitterbewegung · on Oct 11, 2018

TPUv1 were made to have good performance per watt for inference. [1]

TPUv2 were benchmarked against NVIDIa's K80 at 30 times the performance and had a peak of 92 TOPS [2] while the 2080 TI is at 440 TOPS [3] [1] https://www.anandtech.com/show/11749/hot-chips-google-tpu-pe... [2] https://www.tomshardware.com/news/tpu-v2-google-machine-lear... [3] https://www.microway.com/knowledge-center-articles/compariso...

shaklee3 · on Oct 11, 2018

Why would anyone compare a modern tpu against an Nvidia card that is four generations old? The tpu didn't even exist when the K80 came out.

deepnotderp · on Oct 12, 2018

Marketing.

nVidia is destroying the TPUs right now and Google is desperate to keep their perception in the public eye as the king of AI (which tbf, they probably are, compute capabilities aside)

matt4077 · on Oct 12, 2018

You should try the assumption of good faith some time. It generally leads to a happier place.

In this case the data was from the time Google introduced the TPU internally, when the K80 was very much up-to-date. It also makes sense because the K80 was the only GPU offered in GCP.

shaklee3 · on Oct 12, 2018

This is the tpu v2, and it was most certainly not released internally when the K80 was.

deepnotderp · on Oct 12, 2018

You're talking about the tpu v1, not v2 or v3, so no.

Also, there's no need for assumptions when you know what's going on in the design team.

xbmcuser · on Oct 12, 2018

Google usually talks about it's hardware few months after they have moved to a newer better version. They probably are already using TPUv4 by now.

dgacmu · on Oct 12, 2018

TPUv1 was never exposed externally as a thing you could rent in cloud, but with that exception, Google's been pretty consistently trying to offer the best available option to cloud customers.

(disclaimer: while I'm part-time at Google, this is my personal impression, not an official statement, etc., etc.)

xbmcuser · on Oct 12, 2018

They probably feel TPUv3 is stable enough to give access to cloud customers now. But I doubt they do not already have TPUv4 versions running already internally in some capacity maybe only for testing.

emu · on Oct 12, 2018

Since you mention it: Cloud TPUv3 is now available in beta (brand new!) See: https://cloud.google.com/tpu/

(420 TFlop/s, 128GB HBM)

nabla9 · on Oct 11, 2018

4x TPUv2 is roughly equal to 4x V100 performance wise.

The price/performance ratio of rented TPUv2 or V100 can't match the price/performance ratio of owning the system if you are doing lots of learning/inference.

If the model fits inside 2080 Ti and the work is not tightly time restricted, 2080 Ti (the whole $2.5k system) should be more economic choice after six months or less (full utilization 24/7).

glenrivard · on Oct 12, 2018

Really can't do a directly comparison but we can look at cost doing a similar task.

DAWNBench does benchmarks.

"At the time DAWNBench contest closed on April 2018, the lowest training cost by non-TPU processors was $72.40 (for training ResNet-50 at 93% accuracy with ImageNet using spot instance). With Cloud TPU v2 pre-emptible pricing, you can finish the same training at $12.87. It's less than 1/5th of non-TPU cost. "

Here is a link to DAWNBench.

https://dawn.cs.stanford.edu/benchmark/

howscrewedami · on Oct 12, 2018

What about int8 performance? Would it be somehow improved by using tensor cores?

jaytaylor · on Oct 12, 2018

  The RTX 2080 Ti, on the other
  hand, is like a Porsche 911.
  It's very fast, handles well,
  expensive but not ostentatious

Since when is a Porsche 911 not ostentatious? It's a sports car in $100K+ territory and functionally impractical / limited.

Only compared to a car which costs millions could it be considered reasonable and not ostentatious.

Guess I'll be sticking to my "poor mans" 970 GTX :p

fermienrico · on Oct 12, 2018

This is distracting from the original topic. Literally picking on the analogy for no other reason than to show its absurdity. Has nothing to do with RTX 2080Ti.

dahart · on Oct 12, 2018

> Since when is a Porsche 911 not ostentatious?

I feel like he answered that question directly, by comparing it to an ostentatious computer. ;)

“And if you think I'm going overboard with the Porsche analogy, you can buy a DGX-1 8x V100 for $120,000 or a Lambda Blade 8x 2080 Ti for $28,000 and have enough left over for a real Porsche 911. Your pick.“

mychael · on Oct 12, 2018

Depends where you're from I guess. For example, a BMW 328i in Los Angeles is like a Honda Accord in other cities.

Analemma_ · on Oct 12, 2018

Compared to e.g. a Lamborghini, it's not nearly as loud in appearance and can almost pass for a normal car in a grocery store parking lot, as long as it's not fire-engine red or something. But I think this analogy breaks down a little when applied to graphics cards.

dragonwriter · on Oct 12, 2018

> Compared to e.g. a Lamborghini, it's not nearly as loud in appearance and can almost pass for a normal car in a grocery store parking lot

Kinda depends on which Lamborghini; an Urus probably does a better job at paying for normal in a grocery store parking lot (and is better for actually carrying groceries) than a 911.

zdw · on Oct 12, 2018

To extend the analogy, things are going to get even more interesting when this kind of hardware hits the Honda Civic Si or VW GTI (ie 25-30% of the price of the Porsche) pricing levels, but with the same tiny6 performance delta.

hyperbovine · on Oct 12, 2018

If you make 3-4 Porsche 911s a year (not uncommon in these parts) then probably it does not feel ostentatious.