Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you have any more information to share?

The ultra-bloated unrolled vector implementations may appear a little faster in microbenchmarks but the amount of cache they take up and other effects like AVX downclocking (for those implementations that use AVX) mean that in practice they can actually be slower overall.



I've heard this directly from engineers at Intel.


Is this pre- or post-ERMS?


I don't believe it's been consistently the fastest way to copy memory even post-ERMSB. (It would be great and convenient if it were!)


Perhaps for very small sizes (just use a mov directly) and very large ones (that fall out of your cache, so you’d probably want non-temporal stores). I think it would be difficult to cover all cases.


It pretty much is (except for length-zero) starting in Icelake, with FSRM (aka ERMSB II, The Reckoning)


It'd be great if this were true, since that would finally put an end to the madness. But I've hoped this for many, many years, but always been disappointed when someone spends ungodly hours on some hyper-tuned AVX thing that just barely eeks past, and we get stuck with another unwieldy monster for a decade.


I heard this around 2018 or so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: