Lower in this thread, bitL pointed out that the prices we used in our analysis is not exactly in line with current market prices. bitL hit the nail on the head in terms of the biggest weakness of our post: choosing the price. So, we've decided to make the spreadsheet that generated our graphs and (performance / $) tables public. You can view it here:
You can copy that spreadsheet and insert whatever system price (in kilodollars) you want into B15:F15. Hope this makes everybody's decision making easier.
As a system builder and AI research company, we're trying to make benchmarks that are scientific, reproducible, correlate with real world training scenarios, and have accurate prices.
The fp16 results versus the 1080ti are somewhat surprising. The author specifically pointed out that they are using tensor cores were possible. I would have expected fp16 would be more than 100% faster than the 1080ti if they were using tensor cores. Can anyone explain that?
Not everything in the compute pipeline is going to be converted to fp16 operations. Anytime you are doing accumulation or exponentials, you would have to have it in fp32.
Right, but the tensor cores should be about 10x faster on the compute side, and about 2x the memory bandwidth. GEMM is usually constrained by compute, which is why the tensor cores exist.
These benchmarks are for training, so the expectation is that they are running them in fp16 all the way through. Also, tensor cores can accumulate in fp32 registers with a slight hit to performance.
Is there a benchmark you've seen that matches the claimed 10x increase in performance for the Tensor Cores? NVIDIA hype train can sometimes make it difficult to find hard numbers.
I've done my own benchmarks where I've hit over 100 TFLOPS on the V100, and that's about 85% of the peak theoretical throughout of them. Granted the matrix size needs to be large enough, but it's definitely doable. Anandtech also showed similar results in their V100 review. I haven't yet seen a comparable SGEMM done on the 2080Ti, so I don't know how it'll compare. I have some coming in at the end of the month though, so I should know soon.
If you can write a better optimized network, go ahead. But like in SSE2 vs AVX2 vs AVX512 benchmarks, FP performance on paper doesn't always translate into better real world FP performance. Now if Nvidia had switched to HBM2 like Google TPU2, it might be different.
Actually AMD gpus are the most cost effective gpus and have supported full FP16/8 since 2 years.
But people are locked in the Nvidia proprietary jail and no one seems to care...
That is really sad and against the consummer interest, but also, deep learning will never become mainstream if it can only run on 5% of the world hardware (Nvidia).
> Actually AMD gpus are the most cost effective gpus.
No, AMD gpu is zero cost effective because Tensorflow does not support AMD gpus.
> But people are locked in the Nvidia proprietary jail and no one seems to care...
Sounds like you want to blame the users, but this is because Nvidia highly invested on GPGPU and Cuda since more than 10 years ago, while AMD did focus on something else like HSA. It is AMD’s fault.
I agree with kbumsik here. AMD only has themselves to blame. They have great hardware and fantastic theoretical benchmarks. Heck, even their SGEMMs are really fast and in line with the 15TFlops of FP32 on the VEGA 64s that we've benchmarked. However, it comes down to software ecosystem and optimizations for common deep learning subroutines.
MIOpen[1] is a step in this direction but still causes the VEGA 64 + MIOpen to be 60% of the performance of a 1080 Ti + CuDNN based on benchmarks we've conducted internally at Lambda. Let that soak in for a second: the VEGA 64 (15TFLOPS theoretical peak) is 0.6x of a 1080 Ti (11.3TFLOPS theoretical peak). MIOpen is very far behind CuDNN.
Lisa Su, if you're reading this, please give the ROCm team more budget!
If AMD developed CUDA transpilers / DL backends / CUDNN equivalents people would switch over in a heartbeat. It's fully their fault for not investing in the software.
If they did that and had a card that got 2x performance / $ or more I would switch in a heartbeat.
The quality and open source nature of their tools has resulted in much of my research group (real-time vision) increasingly and voluntarily moving to work on AMD platforms (we were previously almost exclusively using CUDA).
> Also, AMD does not limit FP performance on consumer cards.
I don't understand this meme. The consumer cards are different chips with slow fp64 hardware. In what sense is that "limiting" performance relative to the enterprise cards?
On AMD, if I'm not not mistaken, FP16 is twice of FP32, as it should be
"For their consumer cards, NVIDIA has severely limited FP16 CUDA performance. GTX 1080’s FP16 instruction rate is 1/128th its FP32 instruction rate, or after you factor in vec2 packing, the resulting theoretical performance (in FLOPs) is 1/64th the FP32 rate, or about 138 GFLOPs."
I agree with your broader point, but stating that Nvidia comprises a mere 5% of the world's hardware is disingenous in a post where you've only talked about GPUs. In actual fact, Nvidia has the overwhelming majority of the high end GPU market, especially in HPC and ML. And as much as I dislike the proprietary lock-in, they have by far the superior software ecosystem.
Actually nvidia isn't the high end leader if you account console APUs (almost all being from AMD). And most gpus in the world are from igpus, apus or ARM SoCs. Dgpu IS a small market, High end Dgpu is a few percents max.
Deep learning generally doesn't run on AMD GPUs, but it runs on a lot more than only Nvidia GPUs.
If we're only counting model training, it runs on CPUs, Google's TPUs, FPGAs, whatever other secret datacenter ASICs are out there, various DL-specific mobile chips, etc.
Way more than 5% of the world's hardware can run inference with deep neural nets, which is the important thing for mass adoption, and definitely more than only Nvidia GPUs can run training.
Because you started your comment with an objective falsehood, NVIDIA has the most cost effective GPUs as long as CUDA king and AMD doesn’t invest more on the software front
> 37% faster than the 1080 Ti with FP32, 62% faster with FP16, and 25% more expensive.
Maybe in theory, but you can get a new 1080Ti for $700 while 2080Ti is impossible to get under $1200, which makes it 70% more expensive. At that point 2x 1080Ti sounds way better than 1x 2080Ti for Deep Learning to me (up to 22TFlops and 22GB of RAM).
Hey bitL, thanks a lot for pointing this out. I think you hit the nail on the head in terms of the biggest weakness of our post: choosing the price. We've decided to make the spreadsheet that generated our graphs and (performance / $) tables public. You can view it here:
You can copy that spreadsheet and insert whatever system price (in kilodollars) you want into B15:F15. Hope this makes everybody's decision making easier.
As a system builder and AI research company, we're trying to make benchmarks that are scientific, reproducible, correlate with real world training scenarios, and have accurate prices.
Yes. 1080 Ti do not scale linearly, but in most of the benchmarks I’ve seen, two 1080 Ti are at least 1.75x faster than one when doing multi-GPU training.
1.75 / 1.36 (speed up of 2080 Ti over a single 1080 Ti) = 1.28. So expect 2x 1080 Ti to be about 30% faster.
You can see how multi-GPU training works with Titan V benchmarks in the link below. 1080 Ti have similar scaling profile.
Not in theory, in practice, these are actual benchmarks. Your central argument is that prices within the first few weeks of a new product launch are higher than MSRP. That point will likely not hold true 30 days from now. It makes perfect sense to compare the products based on their MSRPs.
The 1080Ti is back at MSRP and the cryptocurrency demand has vanished, so what's left is the demand displaced by the high prices earlier.
In the past, it has been common that new hardware with low initial supply fetched premiums that lasted until supply met demand. It's not dishonest to assume that it's the same case here.
Exactly. We're sort of between a rock and a hard place in terms of choosing a price.
While we could decide to write some software that could pull today's market prices and update tables and graphs dynamically, we just decided to settle on a single number. If you are settling on a single number, the choice is either market price on date of publication or MSRP. Given that other GPUs tend towards their MSRP as time goes on, we decided to choose MSRP.
Your article still states that 2080ti is 25% more expensive, but lists 1080ti at $700 and 2080ti at $1,200 (which are correct prices from nvidia website) which is 70% more. Which one is right?
MSRP may be $999, but you won’t find them for much below $1,200 for the time being.
GPU modules are manufactured in China. Their harmonized codes are covered in recently established tariffs. 10% tariffs are already hitting cards arriving at US ports. This tariff will increase to 25% on Jan 1.
You mean it MAY not hold true 30 days from now. Last year around this time the cryptocurrencies started spiking. Perhaps demand for graphics cards will go up again, and this time, maybe the demand for the 1080 Tis won't match the demand for the newer, shinier 2080 Tis.
nVidia is destroying the TPUs right now and Google is desperate to keep their perception in the public eye as the king of AI (which tbf, they probably are, compute capabilities aside)
You should try the assumption of good faith some time. It generally leads to a happier place.
In this case the data was from the time Google introduced the TPU internally, when the K80 was very much up-to-date. It also makes sense because the K80 was the only GPU offered in GCP.
TPUv1 was never exposed externally as a thing you could rent in cloud, but with that exception, Google's been pretty consistently trying to offer the best available option to cloud customers.
(disclaimer: while I'm part-time at Google, this is my personal impression, not an official statement, etc., etc.)
They probably feel TPUv3 is stable enough to give access to cloud customers now. But I doubt they do not already have TPUv4 versions running already internally in some capacity maybe only for testing.
4x TPUv2 is roughly equal to 4x V100 performance wise.
The price/performance ratio of rented TPUv2 or V100 can't match the price/performance ratio of owning the system if you are doing lots of learning/inference.
If the model fits inside 2080 Ti and the work is not tightly time restricted, 2080 Ti (the whole $2.5k system) should be more economic choice after six months or less (full utilization 24/7).
Really can't do a directly comparison but we can look at cost doing a similar task.
DAWNBench does benchmarks.
"At the time DAWNBench contest closed on April 2018, the lowest training cost by non-TPU processors was $72.40 (for training ResNet-50 at 93% accuracy with ImageNet using spot instance). With Cloud TPU v2 pre-emptible pricing, you can finish the same training at $12.87. It's less than 1/5th of non-TPU cost. "
This is distracting from the original topic. Literally picking on the analogy for no other reason than to show its absurdity. Has nothing to do with RTX 2080Ti.
I feel like he answered that question directly, by comparing it to an ostentatious computer. ;)
“And if you think I'm going overboard with the Porsche analogy, you can buy a DGX-1 8x V100 for $120,000 or a Lambda Blade 8x 2080 Ti for $28,000 and have enough left over for a real Porsche 911. Your pick.“
Compared to e.g. a Lamborghini, it's not nearly as loud in appearance and can almost pass for a normal car in a grocery store parking lot, as long as it's not fire-engine red or something. But I think this analogy breaks down a little when applied to graphics cards.
> Compared to e.g. a Lamborghini, it's not nearly as loud in appearance and can almost pass for a normal car in a grocery store parking lot
Kinda depends on which Lamborghini; an Urus probably does a better job at paying for normal in a grocery store parking lot (and is better for actually carrying groceries) than a 911.
To extend the analogy, things are going to get even more interesting when this kind of hardware hits the Honda Civic Si or VW GTI (ie 25-30% of the price of the Porsche) pricing levels, but with the same tiny6 performance delta.
https://docs.google.com/spreadsheets/d/1La55B-AVHSv9LiQcs6GM...
You can copy that spreadsheet and insert whatever system price (in kilodollars) you want into B15:F15. Hope this makes everybody's decision making easier.
As a system builder and AI research company, we're trying to make benchmarks that are scientific, reproducible, correlate with real world training scenarios, and have accurate prices.