Some discussion about day 2 in yesterday's thread about day one: https://news.yc...

Some discussion about day 2 in yesterday's thread about day one: https://news.ycombinator.com/item?id=15811229

To cross-pollinate further, I found yesterdays presentation about Esperanto particularly fascinating (or to be fair, close my own interests, not that it's objectively better or worse than something else). So let me speculate a bit.

It'll be at least a year or so before they have something out in the market. Given that NVidia Volta claims 7 TFlops/s DP, lets say Esperanto is targeting 16 TFlops/s in order to have a competitive product at launch. Further, lets guess a target clock of 2 GHz. To reach that performance they would thus need 16e12/2e9/2 = 4000 DP FP execution units (the extra divisor of 2 due to FMA being counted as 2 flops). So that would match pretty well with their 4096 minion cores having 1 DP FP unit each. Now, they also say the minion cores have the RISC-V vector extension rather than being scalar cores. And the V extension says that the minimum vector length is 4. Implying that executing an vector arithmetic instruction takes up at least 4 (consecutive) issue slots, a bit like old-school pipelined vector supercomputers (think: Cray 1). Meaning that the primary purpose of the vectors is not to get wider execution width, but to amortize instruction execution overhead (fetch, decode, etc.) and to drive memory level parallelism, like in old-school vector supercomputers.

Further, it is mentioned that each minion core has several hw threads. For the sake of argument, lets make that 4. Now, lets look at the aggregate size of the vector register files. So each VRF has 32 registers. Since they claim to be targeting HPC as well, and not only ML/AI, lets assume that means they support double precision FP. Meaning that the maximum vector element size is 64 bits (8 bytes). And as mentioned before, the maximum vector length must be at least 4. Thus, at a minimum, each VRF is 32 * 4 * 8 = 1kB. So the total size of the register file with 4096 minions and 4 hw threads/minion is at least 4096 * 4 * 1kB = 16 MB. Or if the vector length is 8, 32 MB. Or 8 hw threads/minion, also 32 MB. Or vector length 8, 8 hw threads => 64 MB. Which already starts to be a pretty huge number, even on 7 nm. So I wouldn't be surprised if the vector units bypass the caches and go directly to memory? Oh, and you'll need absolutely gargantuan memory BW to feed this thing.

And how is this thing supposed to be programmed? Vectorizing compilers, obviously, but what then? OpenMP? Message passing between the 16 fat cores, and each fat core running the main thread, farming out to OpenMP threads running on 4096/16=256 minion cores?

Anyone care to confirm, deny, or at least poke holes in the argument above?