What is your metric for "screaming fast"? A user-mode cache with direct I/O can outperform any kernel-mode design several-fold. That's empirical fact, and the reason all high-performance database engines do it. I've designed systems both ways and it isn't particularly close; the technical reasons why are well-understood. Typical direct I/O designs enable macro-optimizations that are either not practical or not possible with mmap().
The main advantage of mmap() is portability.