> Reading from a pointer could be from L1 cache, or it could be from a pci-e card attached to another socket.
The fun one is TLB (translation lookaside buffers) and the virtual memory system.
Today's AMD core's have more L3 cache than what the TLB can handle with 4k-pages. You need to enable 2MB hugepages or 1GB hugepages to even access L3 cache at full speeds in practice...
EDIT: Milan-X has 96MB L3 cache per CCX. 4kB-pages would require 24,000 (24-thousand) TLB-entries. IIRC, Milan only has 2000-TLB-entries. Hurraaahhhhhh....
------
CPUs are devilishly complicated. It makes optimization "fun". Apparently, running "memcpy" requires Ph.D levels of study before you can "memcpy" at full speeds these days.
The fun one is TLB (translation lookaside buffers) and the virtual memory system.
Today's AMD core's have more L3 cache than what the TLB can handle with 4k-pages. You need to enable 2MB hugepages or 1GB hugepages to even access L3 cache at full speeds in practice...
EDIT: Milan-X has 96MB L3 cache per CCX. 4kB-pages would require 24,000 (24-thousand) TLB-entries. IIRC, Milan only has 2000-TLB-entries. Hurraaahhhhhh....
------
CPUs are devilishly complicated. It makes optimization "fun". Apparently, running "memcpy" requires Ph.D levels of study before you can "memcpy" at full speeds these days.