Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nvidia Pascal GPU to Feature 17B Transistors and 32GB HBM2 VRAM (wccftech.com)
134 points by cma on July 24, 2015 | hide | past | favorite | 94 comments


> With 8Gb per DRAM die and 2 Gbps speed per pin, we get approximately 256 GB/s bandwidth per HBM2 stack. With four stacks in total, we will get 1 TB/s bandwidth on NVIDIA’s GP100 flagship Pascal which is twice compared to the 512 GB/s on AMD’s Fiji cards and three times that of the 980 Ti’s 334GB/s.

> The Pascal GPU would also introduce NVLINK which is the next generation Unified Virtual Memory link with Gen 2.0 Cache coherency features and 5 – 12 times the bandwidth of a regular PCIe connection. This will solve many of the bandwidth issues that high performance GPUs currently face.

To point out the obvious, this sounds like it could be fantastic for deep learning. Not just is the RAM big enough to hold a lot of current datasets, it'll help alleviate the latency bottlenecks in updating stuff.


Yup, this is exactly the marketing angle Nvidia CEO was using in his GPU Technology Conference presentation (10x speed-up on deep-learning vs their current Maxwell architecture):

http://blogs.nvidia.com/blog/2015/03/17/pascal/


This also seems to be the motivation behind the already available Titan X with its 12GB of RAM. Some of the gaming hardware reviewers are scratching their heads as to why anyone would need 12GB attached to one die. I chuckled when I saw those reviews that were totally oblivious to the deep learning applications.


The community as a whole is completely oblivious. It's pretty funny to see the youtube reviewers get all worked up over how nvidia and amd are going at it again and such. As if gaming is what's driving this battle. Anyone working in computer science knows the battle is over machine learning, not first person shooters.

Soon headless, socketed solutions will be the preferred form factor for HPC. I image the desktop and server product lines will diverge at that point. It'll be curious to see what will happening to PC gaming at that point.


The gaming hardware junky community is aware, if only as a result of most Titan Black/Z reviews arriving at the conclusion, "a gamer's money is better spent elsewhere, as all a Titan really provides over the latest consumer offerings is good workstation performance (double precision floating point)."

For reference: Titan Z DP FP: 2707 GFLOPS ATI 5870 (released sept 2009): 544 GFLOPS Titan X (current generation - NVidia recognized that gamers don't value DP performance): 192 GFLOPS

I sometimes wonder how it came to be that the Titan Z preceded the Titan X...

E: to clarify, review sites are aware that STEM people need double precision and 6/12GiB GPU memory for something


It's easy to get lost in the noise of the "enthusiast" webring with their semiliterate "benchmarks" and overly paginated "reviews". They punch well above their actual purchasing weight and thus think they're a lot more important then they actually are.


Versus the data scientists and HPC communities that buy hundreds of Nvidia GPUs for CUDA and associated applications.


Well right, but compared to the number of units bought by IT departments, the "enthusiast" market is nothing but decimal dust.


IT departments are not buying Titans and they are not going to be buying Pascals.

IT departments aren't even buying Nvidia at all, they are buying Intel integrated.


That was my point. If all you do is read what passes for news on gamer sites, you might think that the discrete GPU overclocking market was important.


The notion that only the biggest of something is "important" is offensive and wrong.


Sure. I'm a fan of esoteric hardware and software and no big believer in the market success equalling virtue religious ideology, but I'm also not the one invested in the importance of how well this memory stick performs overclocked in Gamer Game IV: Games.


Probably, but remember that if you're maximizing profit then one $100 margin customer is as important as 1000 $0.10 margin customers.


Not to mention margin decreases linearly not as a percentage. So those 1000 $0.10 margin customers can easily turn into -$0.50 margin customers.


The 1000 low-margin customers also come with the added benefit of diversification.


No, you are just totally oblivious to Nvidia's product segments.

The deep learning applications and those that want massive amount of RAM buy Teslas, which have exactly that. Titan X is a gaming card and it's marketed as such.


We do SP GPU development requiring big memories and Titan X features heavily in nVidia's marketing towards us. Tesla is nowhere to be found.


One note on Nvlink: To get that you will AFAIK have to buy a Power8 system. They would need a cooperation with Intel to get support for this on their architecture - and Intel so far likes to keep them on a slow bus.


Yes.. with the current deep dreaming fad, this could attract quite a few hobbyists eager to get near-real time video processing through large nn.


The current deep dreaming fad is going to last about one more week before everyone gets sick of looking at those stupid pictures. They really serve no purpose other than "look how crazy this stuff is" - deep learning is cray-cray.


It's not over until it shows up in a few mainstream media pieces.

Maybe some music videos, or the intro of an HBO show..

Until then, enjoy your eyes and dog faces on family pictures.


any thoughts on why all the current pix look like dog faces? Is that peculiar to the training set?

Also I think there's a bit more than just ZOMGtrippy. The whole project points out that "if you're looking for it, you'll find it." The system amplifies noise until it sees something familiar. Understanding that humans do this too (to a point), is a critical point in comprehension and discussion of the world, opinions and people. But yeah, it will seem rather "iconic" and dated 10 years from now.


The reason you mostly see dog faces is only because everyone seems to be running the default code google posted. It optimizes layer 4c, which seems to be biased to producing dog faces. Running the code on other layers, or other networks produces other things. They also recently posted code where you can provide a "guide" image to guide it to producing features that are in that image.

I'm thinking that using CNNs to generate or change images could be a huge game changing technology. It could potentially be bigger for hollywood than 3d graphics. In 10 years, we could see this stuff everywhere, albeit in a more developed state. It would be used to actually transform or enhance images in a more directed way.


AMD get it together, pretty please. Nobody wants to live in a world where only NVidia produces discrete GPUs!

That 32GB seems too high for cost constrained consumer market though - may be they will have a leaner variant for desktops/gaming.


It could be argued that AMD does have it together. If you look at the benchmarks for pretty much any particular price point, AMD & nVidia cards are roughly comparable. AMD is better at some games and resolutions, and nVidia is better at others. Rarely is the difference more than 10%. nVidia used to have a sizable power/noise advantage, but Fury appears to close that gap. AMD even seems to be closing the driver stability gap.

But the Internet effect on the video card market is huge. There are hundreds of online benchmarks, huge fanboy communities, et cetera. The benchmarks magnify small differences causing a winner-take-all effect. The communities create a bandwagon effect. It's a wonder that AMD maintains the share that it does.

Yes, AMD is obviously #2, but it should be a close #2. In most markets close #2 is not a bad position to be in.


As someone who has bought AMD cards for the past 5 years for gaming, I switched over to Nvidia this time.

You've overlooked that Maxwell / 980 Ti has a ton of overclocking headroom. OC to OC, you're looking at 20-30% performance difference at more commonly used resolutions like 1440p and 1080p. The gap narrows only at 4K.

Voltage is locked on Fury right now and AMD isn't saying why. The best OC I've seen on one is 10% higher clocks, with Maxwell 30% is common. Huge difference considering both cards are in the same price bracket.

Lastly pricing is also an issue with the Fury cards. Usually AMD undercut Nvidia, but this time the flagship matches the 980 Ti in cost. As the underdog, AMD is going to have a tougher time winning people over at the same price point. This happens in any market.

I like to support the underdog, but I couldn't this time given the large disparity in OC performance.


You're illustrating my point. You're OC'ing the cards (which the vast majority of consumers don't do), but even after taking that step there's only a 20-30% gap (and less at the more important resolution of 4K). That's a significant difference, but it's not huge. But it's winner-takes-all -- there's no reason to buy even a slightly inferior card.


OOB they tend not to be that different, hardware wise.

I didn't care either way until NVidia pulled the hairworks stunt, which was a pretty controversial move.


Gotta say though, in terms of driver stability and user experience in terms of related software, etc, NVIDIA has the edge.


Not sure about their graphics division, but AMD has stopped innovating in the CPU department...


They are working on Zen - which is a ground up rearchitecture. They have limited resources so the existing designs (Opterons and APUS) are only receiving minor incremental updates until Zen lands. Zen is expected in 2016, AMD's continued existence as a serious CPU maker depends on it.


Good to know, I haven't followed the CPU news lately

Zen definitely looks interesting, let's see if it's as big as a success as the K7 architecture

> AMD's continued existence as a serious CPU maker depends on it

I agree


So has Intel. Skylake (2016) and Sandy Bridge (2010) have roughly comparable single core desktop benchmarks. Power/performance ratios have increased drastically, but that's mostly because of process improvements. AMD has been stuck on 28nm for a long time, but that's not AMD's fault.


> Skylake (2016) and Sandy Bridge (2010) have roughly comparable single core desktop benchmarks.

That's not quite true -- Intel aims for a big IPC (instructions/clock) improvement for each "tock" generation (Nehalem -> Sandy Bridge -> Haswell -> Skylake), and IIRC has pretty much delivered. Some benchmarks are really hard to push because they're memory/cache-miss bound (so it's really just about throwing in more memory channels and clocking them up), but a lot of things have gotten seriously better for tricky integer code, especially in e.g. branch prediction/uop cache in the frontend and available execution ports in the backend.


http://arstechnica.com/gadgets/2015/07/intel-confirms-tick-t...

While the headline is hyperbole, the fact is that Intel has passively admitted that newer process sizes are taking more time to achieve.


> AMD has been stuck on 28nm for a long time, but that's not AMD's fault.

How isn't it?


AMD is a fabless chip designer (like nearly everyone except Intel and Samsung). They rely on foundries for manufacturing. AMD relies on GlobalFoundries (and NVidia relies on TSMC). Both manufactures have skipped on either foundries 20nm process since apparently the output didn't suit GPU performance requirements.


> AMD is a fabless chip designer (like nearly everyone except Intel and Samsung)

Okay, but they didn't used to be. They sold all their fabs. And when their reliance on third-party fabs meant they couldn't keep up with Intel, they did... nothing. Now, you could argue that they don't have the resources to do anything. But that doesn't mean it's not their fault that they didn't do something, it just means they couldn't.


I think the issue was they couldn't keep their own fabs running hot enough at a large enough scale with just the volume of silicone they were selling.

If the fab sitting idle or production is bellow running cost then they are in trouble. Better to spend that money on R&D.


Let's not forget to that they had to sell the fabs due to Intel anti-competitive market squeezing. They would offer the chips for free, but vendors would still be better off with the Intel kickbacks.

They were later compensated, but the damage done was incredibly destructive.


Is this a joke? AMD are currently pioneering the biggest shake up in CPU architectures in decades through the Heterogenous System Architecture.


Interesting. Is it different from integrated GPU solutions from Intel?


Yes. The two processors will eventually share the same memory space. But I am not wikipedia and I won't go into more detail here.


I've used http://www.videocardbenchmark.net/high_end_gpus.html and https://www.cpubenchmark.net/high_end_cpus.html for years.

It's some what asinine to reduce the complexity down to a single number but I've found that they are usually pretty reflective of what you find in practice.


AMD already has HBM chips out (Fury X and Fury) and already has a shipping chip with 32 GB of memory (S9170). If you are willing to work with OpenCL the double precision rating of AMD chips is very compelling (look at the W9100 that has been out for a while now has a peak rating of 2.62 TFLOPS double precision!). I am looking forward to HBM equipped firepro cards where the double precision support isnt gimped.


> 32GB seems too high for cost constrained consumer market

This thing is definitely not for the consumer market, much less the cost constrained one. This is a small dedicated number crunching machine that, for reasons unfathomable to me, can spit out rendered 3d environments with admirable speed.

I liked to joke that no serious computer has keyboard/mouse/video ports because no serious computer would be used like that. That assumption held well until the late 80's. ;-)

But no. For gaming, this is the superlative of overkill.


Funny you say 'AMD get it together', because its AMD who designed HBM in the first place. Just like it was AMD who pioneered GDDR5 in graphic cards.

This article you just read is typical Nvidia "we are the best .. in 2 years, you just wait!!1". They produced same charts, slides and test results before releasing Tegra, Tegra2 and Tegra3. Every time it was supposed to revolutionize industry and beat competition, every time it released late and benched slower than products on the market.


I meant AMD get it together financially, of course AMD is to be thanked for lot of things including AMD64!


4096bit memory interface is surely supercomputer territory.

My ZX81 only had 8192 bits of memory (1KB).


And it could do 3D graphics. Sort of. 3D Monster Maze: https://www.youtube.com/watch?v=nKvd0zPfBE4


Nice - who needs 8K. 8. K. I haven't even updated to 4K. I don't think I have anything that runs 1080p.


Virtual reality would benefit a lot from from 16000x16000 PER EYE. Rendered at 120+ frames per second with 10,000Hz eye tracking for foveated rendering (120fps may be too low for that though).

For the goal of full VR, computing tech has a long way to go


> The 17 Billion transistors on the Pascal GPU are twice the transistors found on the GM200 Maxwell and the Fiji XT GPU core which is literally insane

Seems like Moore's law is alive and well in the graphics/attached processor space.


That's because GPUs were frozen at the 28nm node since like 2011. It'll be ~4 years of no die shrinks at the top end of GPUs when they finally transition to 16nm. If anything, Moore's law is behind schedule in the GPU space.

Note, that both NVidia and AMD rely on TSMC to manufacture their chips, so they're completely constrained by TSMC's ability to implement new process nodes.


2011: GTX 580, 1.5 TFLOPS

2012: GTX 680, 3.0 TFLOPS (~2.0 attainable)

2013: GTX Titan, 4.4 TFLOPS (~3.2 attainable)

2014: GTX 980, 4.6 TFLOPS

2015: GTX Titan X, 6.7 TFLOPS

Looks to me like they're doubling perf roughly every 2 years.

Meanwhle, my Core i7-5930k's SOL is <1/2 of 2011's GTX 580 at 672 GFLOPS and it still doesn't have fast approximate transcendentals. Skylake begins to fix this, but c'mon, GPUs have had these for almost a decade now...


> Meanwhle, my Core i7-5930k's SOL is <1/2 of 2011's GTX 580 at 672 GFLOPS and it still doesn't have fast approximate transcendentals. Skylake begins to fix this, but c'mon, GPUs have had these for almost a decade now...

Give that GPU a highly serialized workload and watch the actual performance take a nosedive. There's not much of a reason to compare a sniper rifle to a carpet bomb.


Moore's law isn't about performance, it's the number of transistors on a single IC. Perf is related but irrelevant to the discussion.


GTX 580: 585M transistors

GTX 680: 3.5B transistors

GTX Titan: 7.1B transistors

GTX 980: 5.2B transistors

GTX Titan X: 8B transistors

Core i7-5930k: 2.6B transistors

What the data above suggests to me is that relying solely on Moore's Law to predict performance is a fool's errand. Going forward, process transitions are obviously slowing down and IMO victory will go to those who make the best use of the available transistors. Just like programmers who make the best use of the caches and registers in these processors get dramatically better performance than those who can't be bothered to even think about such things.

Intel's business strategy of backwards-compatibility is a giant albatross for them here in that they spend a lot of transistors on this, but clearly otherwise profitable. In contrast, while GPUs are mostly backwards-compatible, they usually oops I meant nearly always oops I meant always need some refactoring to hit close to peak performance. But that usually leads to ~2x performance improvements per generation so far.

Whenever someone complains about having to do this I ask them if they prefer this over hand-coded assembler inner loops for maximally exploiting SSE/SSE2/SSE3/SSE4/AVX2/AVX512? Usually, I get some dismissive remark about leaving that to the compiler. Good luck with that plan IMO.


Just to nitpick, backwards compatibility isn't really a huge issue for Intel. Most of the really old stuff that's a pain to maintain can be shoved in microcode; compilers won't emit those instructions.

There are obvious downsides to the architecture, but the need to be backwards compatibility shouldn't hurt it too much.

GPU workloads are very different in that generally you don't have to look particularly hard to find a bunch of parallelism that you can exploit (if you did, your code would run terribly); so you can generally gain a load of performance by just scaling up your design.

CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable.


That's not really true; backwards compatibility on x86 architectures takes a tremendous amount of power and die space, and the 'throw it in microcode' solution only partially mitigates this issue.

A paper (http://www.ic.unicamp.br/~ra045840/cardoso2013wivosca.pdf) states that a mostly-microcode solution would still require 20% of the die area to be dedicated solely to microcode ROM.

I can't remember where I read it but something like 30+% of an Intel CPU die area/power consumption is due to the x86 ISA. Apparently the original Pentium CPU was 40% instruction decoding by die area. And the ISA has grown enormously since then.


"CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable."

Ironically, to really hit peak performance of a modern AVX2 or later CPU, you have to embrace many of the design principles that lead to efficient GPU code:

1. Multiple threads per core to make use of the dual vector units introduced in Haswell

2. SIMD-like thinking to remap tasks into the 8-way and soon to be 16-way vector units

3. Running multiple threads across multiple cores

4. Micromanaging the L1 cache and treating the AVX/SSE registers as L0 cache

Where the CPU prevails is for fundamentally serial algorithms that cannot be mapped into a SIMD implementation. Mike Acton's Data-Oriented Design covers this case nicely IMO.


My mistake, I don't follow it very closely anymore, so I made my statement with only 2 datapoints :). Have an upvote.


Moore's law is also about cost per transistor.It has risen, so expect to see a very expensive card. Most of the cost/benefit ratio of this card would be probably due to the huge fast memeory, and not the 17 billion transistors.


Now, if only I could get a MackBook Pro with 32GB of system memory... Does anybody happen to know why that is such a difficult thing to achieve? It's amazing that, starting next year, there will be mainstream graphics cards with more memory than top-of-the-line laptops.


Do you need 32GB of RAM in a laptop? 16GB ought to be enough for anybody ;).

Sarcasm aside, this is probably purely business driven decision. The tech is there and it's nearly zero difference for the manufacturer to put either 8GB or 16GB dies on the board. It's the same as with SSD's - companies have to milk existing capacity tier to offer users next ones, otherwise they will have lower profits.


I don't think you could get 16GB SO-DIMMs until earlier this year could you? I know you can get 32GB in a W550s which is probably the thinnest/lightest laptop capable of 32GB.


Can you? I have a W550s and didn't see anyone hocking 16GB SODIMMs even earlier this year.


Yeah a colleague ordered one about 2 weeks ago. Lovely machine and Lenovo have done an amazing job with the weight. I still prefer the T series over the W series as I don't need that much power and prefer the much lighter machine.


I'm waiting for the comparisons between NVidia's offerings and Xeon Phi in real benchmarks


I believe the Xeon Phi is doing quite bad in that comparison, so you don't see much benchmark trumpeting from Intel. Here's one from Nvidia though (so add a pinch of salt or two), showing a 2x - 5x advantage of Tesla K80 over Knights Corner:

http://www.nvidia.com/object/justthefacts.html


"The 17 Billion transistors on the Pascal GPU are twice the transistors found on the GM200 Maxwell and the Fiji XT GPU core which is literally insane. "

https://xkcd.com/725/



Looks like the next few generations of GPUs are gonna be fast!


I think this may be specs for their Tesla cards. I doubt they would throw 32 HBM2 even on a Titan unless they can keep the price point the same (even for folks who can spend 1k USD have limits on their budget or at least their perception of a budget). But I wouldn't doubt by Q2/2016 Nvidia will be bringing more competition to the market with chips based on HBM usage.

IMO, AMD pulled the trigger a tad too soon on their HBM cards, but to be fair their last line of R9 and R7 cards weren't interesting at least to me (and I have a Radeon HD 7870). So, they had to go first to get their customer base excited for the future.


As GPU history has shown, having an extra generation of experience with new things matters a lot (doesn't matter if it's GDDR3/4/5, fabrication nodes, specialized hardware).


Surprises me that RAM hasn't increased more, the machine I'm on is 5 years old and it has 2GB of RAM on the GPU (HD6950) which I paid ~200 quid for 5 years ago.


Memory bandwidth/throughput has been a far larger bottleneck on gaming performance than total video memory for the last few years. They've been trying to deal it by using ever faster and slightly wider memory interfaces, but now they've hit the wall, thus the move to HBM.


I think it's due to the heating issue on the boards themselves. HBM promises to make that issue a bit better since the memory chips are closer to the GPU which makes it easier to cool off than before. But there's a limit due to HBM growing vertically. So, I doubt we'll see massive memory chips on consumer graphics cards in the near future.


What are the parts marked "R125" around the edge?


I believe they are inductors, part of a switch-mode power supply to reduce the voltage of incoming power to the level required by the cores (AFAIK ~1V).


The definitely are inductors (being labeled "Lxx" on PCB is a sure tell-tale sign), and the small 48ish pin packages near them are most likely SMPS controllers. Seems bit crazy that there are total of 18 of them, but I suppose that's what you gotta do when you are pushing possibly hundreds of amps of current.


Makes the normal buck converter arrangement too. Controller, inductor, diode. The GPU will be taking like half a volt so things are definitely going to be getting toasty.


Those big two-legged components look like caps to me though.


Those look like voltage regulators [0].

[0] https://en.wikipedia.org/wiki/Voltage_regulator_module


R125 is often used to refer to resistance of 0.125 Ohms - They may be high wattage resistors (I didn't load article because I am using roaming data while travelling!).


This is not a real site, they have a review from 4 months ago benchmarking a 990 Ti 24GB card with 'minesweeper' and 'solitaire'.


Was that review written on April 1st?


It must have been. They mixed in it with all their other headlines and the date was '4 months ago' so it wasn't labeled April 1st. All their other stories are shaky rumors so I thought the whole site was joke articles predicting extrapolating the next press release.


[deleted]


>general compute (CPU cores and system RAM) than graphical compute.

Why? As a gamer I would expect you would want the opposite. Is there a particular reason you have this rule of thumb?


People underestimate the CPU load of modern games. There's a lot of graphics API overhead in current D3D and GL (though D3D12 and Vulkan will improve upon that), use of dynamic scripting languages (most notably Lua), and most engines don't make the best use of all cores either. And of course, your game is probably sharing the CPU with a number of other applications, which probably aren't hitting the GPU nearly as hard.

Given that, plus the fact that it's a lot easier to upgrade a graphics card than a CPU, makes me think it would make sense to spend more on your CPU than GPU on a fresh build.

(the grandparent comment has been deleted, so I don't know what the context of this discussion is, just throwing my thoughts out there)


> As a gaming enthusiast, my rule of thumb has been to have more general compute (CPU cores and system RAM) than graphical compute

I also have to ask why? An Intel Pentium with a GTX980 will run almost all AAA games at max settings, except a few outliers (BF4 is the only one I can think of right now).


> 40+ cores per socket

Uh... where? Unless you're referring to non-x86 processors, the largest Intel has released so far is 18 cores per socket.


Why so much VRAM, is that really necessary?


I've been look at gtx 750 ti which has maxwell processor in it. I don't care too much about playing games on ultra but the sheer quiet, cool temperature on load, and low power consumption is what makes me attracted to it. I wonder if the new Pascal GPU will make room for more such low power consumption and quiet cards that pack plenty of punch?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: