Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's fun when then you read last Nvidia tweet [1] suggesting that still their tech is better, based on pure vibes as anything in the (Gen)AI-era.

[1] https://x.com/nvidianewsroom/status/1993364210948936055



Not vibes. TPUs have fallen behind or had to be redesigned from scratch many times as neural architectures and workloads evolved, whereas the more general purpose GPUs kept on trucking and building on their prior investments. There's a good reason so much research is done on Nvidia clusters and not TPU clusters. TPU has often turned out to be over-specialized and Nvidia are pointing that out.


You say that like I d a bad thing. Nvidia architectures keep changing and getting more advanced as well, with specialized tensor operations, different accumulators and caches, etc. I see no issue with progress.


That’s missing the point. Things like tensor cores were added in parallel with improvements to existing computer and CUDA kernels from 10 years ago generally run without modification. Hardware architecture may change, but Nvidia has largely avoided changing how you interact with it.


Modern CUDA programs that hit roofline look absolutely nothing like those from 10 or even 5 years ago. Or even 2 if you’re on Blackwell.


They don't have to, CUDA is a high-level API in this respect. The hardware will conform to the demands of the market and the software will support whatever the compute capability defines, Nvidia is clearer than most about this.


But for research you often don't have to max out the hardware right away.

And the question is what do programs that max out Ironwood look like vs TPU programs written 5 years ago?


Sure, but you do have to do it pretty quick. Let’s pick a H100. You’ve probably heard that just writing scalar code is leaving 90+% of the flops idle. But even past that, if you’re using the tensor core but using the wrong instructions you’re basically capped at 300-400 TFLOPS of the 1000 the hardware supports. If using the new instructions but poorly you’re probably not going to hit even 500 TFLOPS. That’s just barely better than the previous generation you paid a bunch of money to replace.


And yet current versions of Whisper GPU will not run on my not-quite-10-year old Pascal GPU anymore because the hardware CUDA version is too old.

Just because it's still called CUDA doesn't mean it's portable over a not-that-long of a timeframe.


Portable doesn't normally mean that it runs on arbitrarily old hardware. CUDA was never portable, it only runs on Nvidia hardware. The question is whether old versions of Whisper GPU run on newer hardware, that'd be backwards compatibility.


> There's a good reason so much research is done on Nvidia clusters and not TPU clusters.

You are aware that Gemini was trained on TPU, and that most research at Deepmind is done on TPU?


> based on pure vibes

The tweet gives their justification; CUDA isn't ASIC. Nvidia GPUs were popular for crypto mining, protein folding, and now AI inference too. TPUs are tensor ASICs.

FWIW I'm inclined to agree with Nvidia here. Scaling up a systolic array is impressive but nothing new.


Sure, but their company's 4.3 trillion valuation isn't based on how good their GPUs are for general purpose computing, it's based on how good they are at AI.


> NVIDIA is a generation ahead of the industry

a generation is 6 months


For GPUs a generation is 1-2 years.



What in that article makes you think a generation is shorter?

* Turing: September 2018

* Ampere: May 2020

* Hopper: March 2022

* Lovelace (designed to work with Hopper): October 2022

* Blackwell: November 2024

* Next: December 2025 or later

With a single exception for Lovelace (arguably not a generation), there are multiple years between generations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: