Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPUs are typically useful for training (due to massive parallelism), but not for inference.


Why not? You got thousands of tensor cores, or tflops under your hand, with already developed APIs, and if you're not too latency-sensitive you can batch a lot. Since you'll be doing the same inference operation millions of time, you don't have to re-prepare kernels and such, use cuda graphs or whatever is the flavour of the day for low overhead, repetitive computation? And if you want to scale a bit, you can add some GPUs before all the PCIe-lanes are all saturated, right? Apart from myriad-x and tpus I'm not sure what could be more useful?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: