GPUs are typically useful for *training* (due to massive parallelism), but not f...

touisteur · on Dec 16, 2020

Why not? You got thousands of tensor cores, or tflops under your hand, with already developed APIs, and if you're not too latency-sensitive you can batch a lot. Since you'll be doing the same inference operation millions of time, you don't have to re-prepare kernels and such, use cuda graphs or whatever is the flavour of the day for low overhead, repetitive computation? And if you want to scale a bit, you can add some GPUs before all the PCIe-lanes are all saturated, right? Apart from myriad-x and tpus I'm not sure what could be more useful?