On modern hardware a key lookup in a hash table isn't necessarily a single page ...

AstralStorm · on Dec 10, 2019

TLB hierarchy is not a B-tree, it is a trie in all of the CPUs. Very different layout (not balanced and also hard sized), much faster on happy path.

Making a 48-bit B-tree would have a bit of a memory problem making TLB huge.

And then CPU cache is an array. Single virtual memory access is bound to be a single physical as well, with minor exceptions for NUMA nodes being crossed.

cperciva · on Dec 10, 2019

There's a reason I said "looks very much like" rather than "is". ;-)

CoolGuySteve · on Dec 10, 2019

But the point remains, walking the cache hierarchy once for a hash table is faster than walking it log times for a Btree.

cesarb · on Dec 10, 2019

> and if the page containing that part of the page table isn't in the TLB you need to read that page...

I thought page tables used physical addresses, which are accessed directly without any TLB lookup (except when nested paging during virtualization, which adds another level of indirection). Of course, the processor still needs to read each level into the data cache(s) while doing a page table walk.

repiret · on Dec 10, 2019

Typically. Although on a virtualized Arm what the guest views as a physical address is really an intermediate physical address that must be translated by the second stage MMU. So it’s possible that reading the first stage page tables can cause a page fault in the second stage MMU. I suspect modern x86 works similarly, but I’m less familiar with that.

ithkuil · on Dec 10, 2019

Page tables can have multiple levels. For example in x86_64 you'd have 4 levels, i.e the virtual->physical mapping is implemented as a tree with depth 4, where each leave and internal node of such tree is 4kb (page size). (As usual, details are more complicated than that)

cesarb · on Dec 10, 2019

Yes, and each level of the tree has the physical address of the next level, so no TLB lookup is necessary (the top of the tree, in the TTBRn or equivalent registers, is also a physical address).

ithkuil · on Dec 10, 2019

oh yeah, I totally misread the comment: a page fetch is required, but that has nothing to do with the TLB indeed

narnianal · on Dec 10, 2019

what's the difference? "page fetch" is not really something that can be googled on the web.

ithkuil · on Dec 10, 2019

the TLB is just one element of the process that leads to resolve a virtual address into a physical one: it's a cache that hosts the most recently resolved addresses.

When the virtual address you're looking to resolve is not present in that cache (i.e. when you have TLB miss), the CPU falls back to walking the page table hierarchy. At each level of the tree, the CPU reads an physical address of the next level of the tree and performs a memory fetch of that page table entry (in my previous comment I erroneously said a "page fetch", but it's actually only performing a cache-line sized fetch) and repeatedly so until it reaches the leaves of the tree which contain the Page Table Entry that contains the physical address of the (4k) physical page associated with the virtual page address you wanted to resolve.

jdsully · on Dec 10, 2019

If huge pages are used then it's very likely the page is cached in the MMU.

loeg · on Dec 10, 2019

Depends on your workload and how many TLB entries your CPU has for superpages. The Zen 2 TLB can hold tons (1000s) of 2MB superpages but relatively few (64) 1GB superpages. Older CPU models had worse capacity for 1GB and 2MB superpages. E.g., Haswell (2013) had only 32 entries for 2MB superpages and 4 entries for 1GB superpages (data).

cperciva · on Dec 10, 2019

In addition to the limited number of cache slots available for superpages (varies depending on cpu), remember that those can be invalidated (again, depending on cpu). If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.

davedx · on Dec 10, 2019

> If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.

Is that really likely on a production database server?

cperciva · on Dec 10, 2019

Depends on the design. At a minimum you're ping-ponging between userland and kernel; but you might also be bouncing between a transport layer unwrapper, an authentication front-end, the database core, and a storage back-end.

namibj · on Dec 11, 2019

Ideally, with io_uring(2)/DPDK/SR-IOV-NVMe, you can skip enough syscalls to drop their performance impact below 1%.

jdsully · on Dec 10, 2019

Spectre may have changed things, but for a long time a user->kernel switch didn't imply a full TLB flush.

notduncansmith · on Dec 10, 2019

And that’s assuming you’ve missed several layers of cache between the CPU and main (virtual) memory!

_pd19 · on Dec 10, 2019

This is not how TLB lookup works

slashdev · on Dec 10, 2019

The OP is talking about disk page reads, not virtual memory pages.