Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just make memcpy a bonafide hardware instruction already. Since it could take unbounded execution time, you could, e.g. limit to a fixed, large size, like 4KB. That'd be a huge benefit, just in handling all the alignment/misalignment cases. And it's big enough that the loop you need to put around it has relatively low overhead. You could specify that address translations (and therefore faults) happen at the beginning, and no data is written.


ARM did so in ARMv8.8, and x86 has had REP MOVS since forever (though it only became consistently faster than SW sequences relatively recently).


Not just an instruction, but something the MMU can use to coalesce memory in the background, if the MMU can do multiple levels of indirection or RC pages, then it can also do GC. The MMU has been too underpowered for too long.


x86 has had that since the 8086. It's called REP MOVS. As far as I know, it's still the fastest general-purpose way to do copying.


It certainly wasn't the fastest general-purpose way to do copying for the decades of CPUs between the 8086 (late 1970s) and introduction of enhanced rep movsb (ERMSB) in Ivybridge (2012). For post-Ivybridge CPUs, I believe it was non optimal for some usage patterns for a while, too. It may be optimal in very recent Intel CPUs, but that's a far cry from saying it's been the fastest since introduction in the 8086.

E.g., Agner: https://www.agner.org/optimize/blog/read.php?i=372

E.g., Stackoverflow: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...


Right. The follow-on to ERMSB, FSRM ("fast short rep movs"), which first appeared in Icelake, finally makes it consistently competitive with SW sequences¹.

¹ but you still want to branch around it when the length is zero ("when the length is zero?!" I hear you cry; it turns out that if you instrument actual systems, you will find that this happens shockingly often).


> FSRM finally makes it consistently competitive with SW sequences

Mateusz guzik says it's decent above 128 bytes, but that software sequences still win below that.


It depends on the exact distribution of sizes (and especially if the size[s] are statically knowable—e.g. if you are copying exactly 31 bytes, or something like an unknown size between 48 and 62 bytes, a SW sequence will still win), but it is now _competitive_ if not actually as fast (previously it was often 2-3x slower in that range, even when the length was not fixed).


I've heard this before and unfortunately it isn't.


Do you have any more information to share?

The ultra-bloated unrolled vector implementations may appear a little faster in microbenchmarks but the amount of cache they take up and other effects like AVX downclocking (for those implementations that use AVX) mean that in practice they can actually be slower overall.


I've heard this directly from engineers at Intel.


Is this pre- or post-ERMS?


I don't believe it's been consistently the fastest way to copy memory even post-ERMSB. (It would be great and convenient if it were!)


Perhaps for very small sizes (just use a mov directly) and very large ones (that fall out of your cache, so you’d probably want non-temporal stores). I think it would be difficult to cover all cases.


It pretty much is (except for length-zero) starting in Icelake, with FSRM (aka ERMSB II, The Reckoning)


It'd be great if this were true, since that would finally put an end to the madness. But I've hoped this for many, many years, but always been disappointed when someone spends ungodly hours on some hyper-tuned AVX thing that just barely eeks past, and we get stuck with another unwieldy monster for a decade.


I heard this around 2018 or so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: