Just make memcpy a bonafide hardware instruction already. Since it could take un...

stephencanon · on Dec 19, 2022

ARM did so in ARMv8.8, and x86 has had REP MOVS since forever (though it only became consistently faster than SW sequences relatively recently).

nerpderp82 · on Dec 20, 2022

Not just an instruction, but something the MMU can use to coalesce memory in the background, if the MMU can do multiple levels of indirection or RC pages, then it can also do GC. The MMU has been too underpowered for too long.

userbinator · on Dec 19, 2022

x86 has had that since the 8086. It's called REP MOVS. As far as I know, it's still the fastest general-purpose way to do copying.

loeg · on Dec 20, 2022

It certainly wasn't the fastest general-purpose way to do copying for the decades of CPUs between the 8086 (late 1970s) and introduction of enhanced rep movsb (ERMSB) in Ivybridge (2012). For post-Ivybridge CPUs, I believe it was non optimal for some usage patterns for a while, too. It may be optimal in very recent Intel CPUs, but that's a far cry from saying it's been the fastest since introduction in the 8086.

E.g., Agner: https://www.agner.org/optimize/blog/read.php?i=372

E.g., Stackoverflow: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...

stephencanon · on Dec 20, 2022

Right. The follow-on to ERMSB, FSRM ("fast short rep movs"), which first appeared in Icelake, finally makes it consistently competitive with SW sequences¹.

¹ but you still want to branch around it when the length is zero ("when the length is zero?!" I hear you cry; it turns out that if you instrument actual systems, you will find that this happens shockingly often).

moonchild · on Dec 20, 2022

> FSRM finally makes it consistently competitive with SW sequences

Mateusz guzik says it's decent above 128 bytes, but that software sequences still win below that.

stephencanon · on Dec 20, 2022

It depends on the exact distribution of sizes (and especially if the size[s] are statically knowable—e.g. if you are copying exactly 31 bytes, or something like an unknown size between 48 and 62 bytes, a SW sequence will still win), but it is now _competitive_ if not actually as fast (previously it was often 2-3x slower in that range, even when the length was not fixed).

titzer · on Dec 19, 2022

I've heard this before and unfortunately it isn't.

userbinator · on Dec 20, 2022

Do you have any more information to share?

The ultra-bloated unrolled vector implementations may appear a little faster in microbenchmarks but the amount of cache they take up and other effects like AVX downclocking (for those implementations that use AVX) mean that in practice they can actually be slower overall.

titzer · on Dec 20, 2022

I've heard this directly from engineers at Intel.

saagarjha · on Dec 20, 2022

Is this pre- or post-ERMS?

loeg · on Dec 20, 2022

I don't believe it's been consistently the fastest way to copy memory even post-ERMSB. (It would be great and convenient if it were!)

saagarjha · on Dec 20, 2022

Perhaps for very small sizes (just use a mov directly) and very large ones (that fall out of your cache, so you’d probably want non-temporal stores). I think it would be difficult to cover all cases.

stephencanon · on Dec 20, 2022

It pretty much is (except for length-zero) starting in Icelake, with FSRM (aka ERMSB II, The Reckoning)

titzer · on Dec 20, 2022

It'd be great if this were true, since that would finally put an end to the madness. But I've hoped this for many, many years, but always been disappointed when someone spends ungodly hours on some hyper-tuned AVX thing that just barely eeks past, and we get stuck with another unwieldy monster for a decade.

titzer · on Dec 20, 2022

I heard this around 2018 or so.