Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree that utilization by nvidia-smi is a poor proxy for performance. FWIW, I’ve found that for the same architecture the power consumption reported in nvtop very often correlates super nicely with the training performance and the peak performance is always at peak power consumption. Agreed on your advice for getting to tune your architecture details, but once that’s fixed and you have simple things to debug like memory usage, batch size, dataloading bottlenecks the raw power metric is typically a quick proxy. I find the temperature is a second useful macro metric that; you want to be at max power draw and max allowed temp at all times but not exceed the temperature where you throttle.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: