Only 3-cycle swap is nice. I can see how you could write very specialized code f...

moonchild · on Dec 18, 2023

3 cycles is throughput, not latency; latency for normal loads is not very different from desktop parts. They win for atomic ops because they don't have a fence built in; I don't understand why intel hasn't specced non-fencing atomic ops (there are far atomics, but those are different).

gpderetta · on Dec 18, 2023

Ah, right, misread the table. I think the latency would be more relevant in this case.

Re Intel atomics, yes being fully fenced by default is annoying, but it kind of falls out of TSO: it would be strange if an RMW were to be more relaxed than normal stores, so, like all stores, it has to wait for prior stores to be flushed out from the store buffer before continuing, which is high latency. Normally the latency is not observable for stores, but an RMW necessarily cannot execute the load ahead of the store.

I haven't really found much documentation on them, but it is possible the far atomics might end up having these relaxed semantics (and I suspect most of the time they won't be implemented as actual far atomics). Do you have any more info on them than what's on the Intel docs?