Mio – Cross-platform header-only C++11 library for memory-mapped file IO

mandreyel · on Oct 20, 2018

Author here. Long time lurker, but made an an account now.

Wow, I did not expect this. I'm really touched. I wrote this as a small utility for my own consumption because I was unsatisfied with the existing selection at the time, so I'm both surprised and delighted to learn that people are finding it useful. Although to be completely frank, I think this library is way too small and insignificant to deserve a spot on HN's front page, but it definitely made my day. So thank you kind stranger who posted it!

starbugs · on Oct 20, 2018

Op here. Thank you for creating mio! Your project clearly deserves the attention. I just found it and thought it would belong here. A lot of people seem to share that opinion :)

I am sure this won't be the last top HN post about one of your projects.

Perfect is the enemy of good.

mandreyel · on Oct 20, 2018

Thank you for the kind words. I'll buy you a beer if we ever meet!

CyberDildonics · on Oct 20, 2018

I like and took a look at the github. I think there is actually a missed opportunity here to make it a single header library since there seem to only be 4 or 5 files that go into it.

mandreyel · on Oct 20, 2018

That is a great idea, thanks. Opened an issue on it.

raphlinus · on Oct 20, 2018

The name definitely made me think of Rust's mio. On the other hand, `namespace cplusplus` and `mod rust` maybe are disjoint.

mandreyel · on Oct 20, 2018

This is definitely unfortunate, but in my defense I was not aware of Rust's mio (or anything related to Rust beyond its existence) at the time of writing and naming my library. I have no emotional investment in the name, so I'm open to suggestions should anyone take issue with it.

kibwen · on Oct 20, 2018

I also think Rust's mio is sufficiently behind-the-scenes that the only people who will encounter it will already be well aware of its purpose; typical Rust users will be two levels removed from mio, if they're using it at all. Namespace collisions matter more for end-user software than foundational libraries. :)

rwbt · on Oct 20, 2018

For those of you already using Boost, Boost also has a MMAP in their IOStreams library and it works pretty well (like most things boost).

https://www.boost.org/doc/libs/1_68_0/libs/iostreams/doc/cla...

gpderetta · on Oct 20, 2018

There is also boost interprocess which, despite the name, provides a very general way to access shared memory; in particular it provides shared memory allocators and a full reimplantation of the stl that can take advantage of them.

I've always wanted to try irt, but it can't handle unexpected process failure (i.e. a crashed process will leave the memory in an unknown state) which is something I always end up needing.

StreamBright · on Oct 20, 2018

And? I am not sure what is your comment about.

mandreyel · on Oct 20, 2018

It's a valid point, I think. I'd probably also trust something so established as Boost more than some random guy's lib on GitHub. However, I specifically wrote mio because I prefer not to use Boost, and from what I understand, many others don't either.

cleeus · on Oct 20, 2018

boosts quality varies

I've seen parts that weren't much better then someones lib on github, because essentially that's what boost is.

quotemstr · on Oct 20, 2018

I wish people used mmap less.

Creating a new memory mapping can be pretty expensive! On both Windows and Linux, it involves taking a process-wide reader-writer lock in exclusive mode (meaning you get to sit and wait behind page faults), doing a bunch of VMA tree manipulation work, doing various kinds of bookkeeping (hello, rmap!) and then, after you return to userspace, entering the kernel again in response to VM faults just to fill in a few pages by doing, inside the kernel, what amounts to a read (2) anyway!

Sure, if you use mmap, you get to look at the page cache pages directly instead of copying from them into some application-provided buffer, but most of the time, it's not worth the cost.

There are exceptions of course, but you should always default to conventional reads.

osivertsson · on Oct 20, 2018

Another problem with mmap is that there is no good way to handle I/O errors.

The application will get a signal (SIGSEGV/SIGBUS, can't remember), and no information about what the problem could possibly be. Most applications do not catch these signals and will instead just terminate.

Even if you do catch the signal there is a real challenge to know what caused the signal and to keep consistent book-keeping to be able to perform any sane action in response.

At a previous employer we started seeing this problem when scaling things in production which was no fun.

__sr__ · on Oct 21, 2018

I was burnt by a third party application using mmap over NFS recently. Definitely not fun by any definition of it.

And it gets a SIGSEGV and segfaults.

pradn · on Oct 20, 2018

A library that helped manage MMAP errors would be extremely helpful.

quotemstr · on Oct 20, 2018

I wanted to help address this problem with my signal sharing API proposal. Unfortunately, the glibc people have the attitude that nobody should be using signals, and so they refuse to improve the signals API at all. I strongly disagree.

https://www.facebook.com/notes/daniel-colascione/toward-shar...

spc476 · on Oct 20, 2018

There's a reason for that---signals are horribly abused as a general purpose signaling mechanism. Originally they were for signaling of actual problems in the code (SIGSEGV, SIGILL) but later signals were not (SIGWINCH I'm looking at you!).

I read the page you linked to, and just off the top of my head, trying to manage paging by catching SIGSEGV is how do you determine that it's in response to a real bug (say, dereferencing an undefined pointer)? In my opinion, by the time you get a SIGSEGV, you can't trust the program at all. While it might be nice to have a process handle page faults itself, I think a better API than signal() is required.

quotemstr · on Oct 20, 2018

To distinguish the SIGSEGVs that represent crashes from ones representing faults you care about, you look at the fault address. It works perfectly well: every high performance Java or C# runtime on Linux (e.g., HotSpot, ART) does it. You can trust the program, because SIGSEGV delivery isn't magic.

As I detailed in the doc and on libc-alpha, you really do need some kind of synchronous exception mechanism to match how real hardware behaves, and it would behoove libc authors to make this mechanism not suck instead of pretending that synchronous faults would just go away.

mcguire · on Oct 21, 2018

Aaaaah. Memories.

If you caught SIGSEGV on AIX 3.2.5 to manage a mapped NFS file, you deadlocked that filesystem. Great fun!

Thanks for the flashback!

gmueckl · on Oct 20, 2018

If you are reading large chunks of data at once - and you should - then the same process happens, but it is hidden inside the memory manager.

I, for one, used memory mapping in the past to significantly speed up code that operates on gigantic data sets. I guess that your mileage does vary.

jabl · on Oct 20, 2018

Case in point, a sort-of general purpose IO library was using mmap as the OS interface, and presenting a read/write style API. We had various bugs related to this (e.g. extending or truncating a mmap'ed file is slightly tricky to do correctly etc.).

I ripped out the mmap codepath, fixing the mmap-related bugs, and for some benchmarks performance improved by a factor of 20.

Now, there are certainly places where mmap is awesome. E.g. if you can push the mmap semantics up to the application level, or you need the sharing semantics etc.

mandreyel · on Oct 20, 2018

This is a valid point. My use case was very frequent reads of large files at pretty much unpredictable positions, so in theory mmap seemed justified. However, I never got around thoroughly testing this assumption, and may indeed just have been better off using read(2) and its variants.

You seem very experienced, so I hope you don't mind a question. In my use case the files were as large as tens of gigabytes and I was creating read-only mappings of 256KB-1MB chunks in them, keeping the mmap handles around according to a cache policy and RAM usage limit. Do you think in this case using mmap could in theory introduce performance gains?

[edit: typo]

gmueckl · on Oct 20, 2018

I think that this is the wrong way to use mmap. Just map the whole file at once. The operating system will automatically read the pages you access from disk. And if memory gets tight, these pages will be flushed to disk if they are dirty and then discarded before the system starts paging. These mmapped pages essentially live in the disk cache.

mandreyel · on Oct 20, 2018

Hmm, I see, I haven't come across this suggestion but I may not have researched this topic well enough. I'll try your suggestion next time.

I appreciate the responses--learned something new today!

quotemstr · on Oct 20, 2018

> The operating system will automatically read the pages you access from disk. And if memory gets tight, these pages will be flushed to disk if they are dirty and then discarded before the system starts paging.

You can tell that you understand how modern OS memory management works when you realize that the OS "automatically read[ing] the pages...from disk" and "flush[ing them] to disk" on memory pressure is paging whether those pages are anonymous pages or mmaped file pages. :-)

[Edit: flushing dirty file-paged pages is analogous to swapping anonymous memory to the swapfile. Discarding clean file-backed pages is a bit like discarding anonymous pages that have been made unused through munmap, process death, etc.]

But to the GP's point: you don't need (except to conserve address space) to limit file mapping size. I think he really wants something like MADV_FREE. But it's complicated.

gmueckl · on Oct 20, 2018

Therr is a subtle difference between anonymous and mapped readonly pages: the later can be discarded right away because their contents were read from permanemt storage to begin with. Anonymous pages need to be written to disk first and that is significantly slower.

jnordwick · on Oct 20, 2018

Yeah +1 for someone who understands mmap.

quotemstr · on Oct 20, 2018

Random access to large files is a legitimate use case! LMDB [1] uses a similar technique, and it works well for them. But depending on the specific application, explicitly application-managed caching via O_DIRECT IO with something like threaded pread or AIO might be even better, because with this explicit model, you control the cache sizing and eviction policy, and it's certainly possible with application-level knowledge to do better than the generic kernel-level LRU/active/inactive/kinda-sorta-works-heuristic stuff can do without application-specific knowledge.

Another advantage of using application-managed caching is the ability to take advantage of things like huge pages (which can drastically reduce TLB miss rates), whereas with conventional mmap of conventional files, you're limited to regular 4kB (or whatever) small pages and associated management overhead. (There's no reason in principle filesystems can't use huge pages for page cache, but AFAIK, nobody does it yet.)

OTOH, kernel management of page cache allows for better integration of cache eviction with system memory pressure signals and allows for multiple users of a single file to share the memory mirroring the contents of that file.

> Do you think in this case using mmap could in theory introduce performance gains?

It depends. The right approach depends on a lot of factors, including workload and developer complexity budget. It's funny, really: the more experience you get, the less likely you are to say "$SOLUTION is the bestest evar!" and the more often you say "well, it really depends, so I can't give you an answer".

What really strikes me as needless is someone using mmap to read a 10kB ~/.myapplication.lol.ini file or something.

[1] http://www.lmdb.tech/doc/

mandreyel · on Oct 20, 2018

Oh, I missed this response (still a little overwhelmed). I'm so glad I checked again because there is some precious wisdom in there. Thank you for taking the time to write it down. And it indeed seems like I was misusing mmap...well, next time I'll know better!

cbsmith · on Oct 20, 2018

Nothing wrong with using mmap for this, though I'd just do a big map of the whole file. That said, you can achieve similar results with readv...

gok · on Oct 20, 2018

The cost of the second kernel trip (on the first page fault) is often mitigated by speculative read-ahead, or the fact that a given page is often in the UBC already. And file-backed memory doesn't contribute to dirty memory. And mmap() makes it easy to use read-only memory, which catches memory corruption bugs. Plus it's easier to use huge pages, which reduces TLB pressure.

I <3 mmap

nemo1618 · on Oct 21, 2018

>mmap() makes it easy to use read-only memory, which catches memory corruption bugs

You might appreciate this toy Go package I hacked together: https://github.com/lukechampine/freeze

It uses mmap and mprotect to "freeze" Go objects; if you try to modify a frozen object, the program crashes.

gok · on Oct 21, 2018

Nice!

jzwinck · on Oct 21, 2018

Memory mapped I/O actually prevents the use of huge pages if the backing storage is an actual disk. See https://lwn.net/Articles/718102/

Do you know of a way around that, aside from building your own patched kernel?

gok · on Oct 21, 2018

Yeah you need to move your files to tmpfs, but I still consider that easier than dealing with making it happen with malloc+read

salvad0r · on Oct 20, 2018

Bit of a plug here. But I used this library in a small side project (C++ publisher-subscriber library) and it worked like a charm. There are a few things you can customize with mmapping but most use cases will do fine with this. Great work.

https://github.com/drali/pubsub

chris_wot · on Oct 20, 2018

I wonder if this could replace the cross platform memory mapping in LibreOffice. This is part of the Operating System Layer (OSL) which is at least several decades old and uses a C interface.

https://opengrok.libreoffice.org/xref/core/include/osl/file....

chris_wat · on Oct 26, 2018

That would not be a good idea at all. The memory mapping works cross platform to prevent an unsubstantiated configuration of the elon-burrow mechanisms.

yread · on Oct 20, 2018

not a C++ person here but is header-only library an advantage?

orbifold · on Oct 20, 2018

Yes because there is no standardised build system for C++, which makes integrating non-header only libraries a pain (or at least somewhat more painful), especially if your aim is cross platform code. In that case you can not rely on a reasonable package manager being present and will have to essentially include all your dependencies in the build, this is trivial for header only libraries.

asveikau · on Oct 21, 2018

Overstated problem. It's not that hard under most circumstances to build a static lib.

I think there are two issues. In C++, a header only library makes sense due to templates, which don't produce object code before specialization. In C, I have seen the same with entirely macro-based "libraries" that are really simulating templates in order to make abstract data structures.

The second case though, my theory is that there is a new generation of programmers who don't understand the traditional compile and link phases because they are used to other languages. It doesn't fit their expectation of how it should work so they bend the tool to their expectation instead of figuring out the old way. The very fact that people on here are saying a programming language should have a "standard package manager" demonstrates the cultural divide: this sounds a little nutty to an old time C person.

anonknowsaguy · on Oct 22, 2018

As newer generation C++ programmer, can confirm I'm guilty of writing things header only when they "probably shouldn't be". But it's not so much I don't understand the old way. Having everything in one file is just so much easier to work with. The price you pay for it is relatively low(for small projects like this library).

StreamBright · on Oct 20, 2018

Is there a reason not to have a standard build system?

twic · on Oct 20, 2018

There's no reason not to have one, but there isn't one. There are a few build systems for C++, but none of them are perfect, and none has won out.

tom_ · on Oct 20, 2018

No consensus on how it should work, and the standard anyway deals with the runtime environment rather than the tooling.

rkangel · on Oct 20, 2018

The advantage is how easy they are to integrate - just put the file in your tree and then #include. No build option or compiler issues.

The disadvantage (and the reason that I'm personally not a fan) is that they bloat compile time. If it's a library you compile it once, and then it gets linked repeatedly, but a header only library is compiled every time any CPP file that includes it is compiled.

makecheck · on Oct 20, 2018

“Compiled every time” is not really true if the header correctly contains “#pragma once” or #ifdef guards.

It may be that you have used a lot of template-based headers, which may compile nearly every time because they are literally creating new code every time a new combination of template parameters is given.

rkangel · on Oct 20, 2018

It'll be compiled once for every Cpp file that includes it. Pragma once (or ifdefs) means that it only gets included once for that compilation, and has no effect on any other Cpp files.

You are correct that I'm mostly talking about template heavy header files. There is a strong correlation between template based header files and header only libraries. The matter) latter generally means the former.

jcelerier · on Oct 20, 2018

> It'll be compiled once for every Cpp file that includes it.

but are you going to mmap stuff in all your .cpps ? Most of the time when I use an external library, it does not get out of a single implementation file, so it being header only does not really make it worse

paavoova · on Oct 20, 2018

You still have to compile the library each time you recompile whichever source file includes it. If the library had its own source file, you would only have to compile once and link it thereafter. This adds up if you have a lot of header-only libraries in your code base, and the total LoC may very well dwarf your own codebase. This is where precompiled headers come in, but introduce their own downsides.

Fronzie · on Oct 21, 2018

Setting up the dependency infrastructure is a bit of work in C++. Recently things have been moving forward, with tools like Cmake+Conan or vcpkg becoming more popular.

Once the infrastructure is sorted out, header-only has mainly disadvantages, like making it harder to separate interface from implementation.

There's a good talk on the topic: https://youtu.be/sBP17HQAQjk

steveklabnik · on Oct 20, 2018

It's generally considered one, yes.

ketzu · on Oct 20, 2018

In C++ header-only libraries are nice as they do not need a separate compile and linking step.

dima55 · on Oct 20, 2018

If you're the sort of person that C++ is a good language then yes. Otherwise, no. It has exactly the issues you think it has.

freekh · on Oct 20, 2018

I wonder: is there something like this written in rust?

steveklabnik · on Oct 20, 2018

https://crates.io/crates/memmap is the most used mmap package, followed by https://crates.io/crates/mmap

kthielen · on Oct 20, 2018

Pretty cool, must be something in the air. This is a useful technique, and having it in a self-contained header-only lib is handy too.

I made a similar library for writing data into memory mapped files, also a self-contained header-only lib:

https://github.com/Morgan-Stanley/hobbes/blob/master/include...

This one also serializes a representation of the type structure of recorded data so that it can be safely concurrently mmapped and read either with the same code or with the generic PL/compiler that I've developed in this hobbes project.

Where unpredictable disk latency is a problem, we've got a similar header-only lib for logging into shared memory (then have another process to consume this shared memory ring buffer and dump it to disk for concurrent querying):

https://github.com/Morgan-Stanley/hobbes/blob/master/include...

This pipeline works well for having lightweight C++ processes feeding large volumes of data to generic query processes that we can run out of band to look at this data in various ways (with a Haskell-like query language).

We did hit a slight problem doing things this way that the straightforward representation of data (as in memory) for some cases just used too much space and too much time wasted in I/O. Basically for complex market data, where data structures aren't trivial and recording ~100GB/day makes it very awkward to keep around a few weeks of data for random querying.

So I also made this header-only lib to write data into these mmapped files with a simple compression method (I like to describe it as generalizing Curry-Howard to probabilities) that gives us much better throughput, much smaller files, faster query times, and still support concurrent constant time random access queries:

https://github.com/Morgan-Stanley/hobbes/blob/master/include...

It gives us compression ratios about the same as EOD gzip, but much faster and importantly works online and with these query use-cases we have with hobbes.

Anyway, maybe I should write up those details somewhere else, I just mean to say that this is a useful technique and you can push it very far and do many things with it in a very straightforward way.

alexnewman · on Oct 21, 2018

Already tons of projects called mio. Please rename

loup-vaillant · on Oct 20, 2018

Why, why do they make it header only? Is it so difficult to integrate a couple source file along with the existing headers?

We should not forget compilation times. A project I use depends on spdlog, a header-only C++ logging library. The thing adds almost two seconds per compilation unit to single threaded build times. And since logging is kinda used everywhere, the whole project takes forever to build (trice the build time it would have had without spdlog, I've measured).

What benefit is so great that it is worth killing compilation times?

flohofwoe · on Oct 20, 2018

Proper "stb-style" header-only libs split the header into 2 parts: the declaration part that's always included and parsed but only contains the public API declarations, and the implementation part inside an #ifdef XXX_IMPLEMENTATION section which is only included and parsed in a single source file.

The same idea can also be used for C++ headers (unless it's all template interfaces).

With such stb-style headers, you can even get better compile times, because you can include all header-libs into a single implementation source file, giving the same advantages as unity-builds (merging all sources into one file).

PS: origin of the name "stb-style": https://github.com/nothings/stb (although I guess the general idea existed before)

coss · on Oct 20, 2018

Easy integration into projects and you can always make a precompiled header.

Personally I love being able to test something out with ease and then make it work with the build system if I like the results.

bno1 · on Oct 20, 2018

I once tried to compile nheko, a Matrix client written in C++ and Qt. It makes heavy use of boost and compiling it crashed my system when I tried to compile using 2 jobs or more because it would eat up all my ram (6GB available at that time). Factor in the 2 seconds per compilation unit and it takes ages to compile. I managed to compile the native webrtc library with 8 jobs faster than this bare bones chat client on the same computer.

tjalfi · on Oct 21, 2018

Here is why I used a header only library for my last C++ project.

Earlier this year I wrote a Win32 program to solve an obscure issue for my employer.

My main concern was making it possible for programmers with limited knowledge of C++ to maintain the application. We are a C# shop and I want my coworkers to be able to maintain the application after I leave.

Here are some of the other things I did to make life easier for the maintenance programmer.

I documented how the program works, what APIs it calls, and added links to Pluralsight courses on ATL and COM. The program uses printf format strings so there is a link to a tutorial on them.

I added extensive logging with line numbers for every action that the program takes. You can reconstruct the control flow by looking at the log file. Yes, I tried doing this.

I picked a header only logging library so that it would be easier for a maintenance programmer to update it. All they have to do is update the submodule that contains the library.

ur-whale · on Oct 20, 2018

Doesn't your compiler support pre-compiled headers?

loup-vaillant · on Oct 21, 2018

The "rebuild from scratch" integration server does not.

ajross · on Oct 20, 2018

The code seems clean. I'm not sure this is a great idea in practice, though. Generally the only good reason for mapping stuff out of the filesystem is performance, and VM behavior with mmap() varies wildly across systems (and filesystem backends, and drivers if it's a hardware device, and hardware if it's a framebuffer, and...). Frankly on windows this is AFAIK a mostly-unheard-of technique. No one does mapping.

This just isn't really something that can be cleanly abstracted to do what you want it to do, even if you can make the code "look" the same.

But again, it looks like a nice, clean, modern C++ library. Just IMHO misapplied.

buckminster · on Oct 20, 2018

> Frankly on windows this is AFAIK a mostly-unheard-of technique. No one does mapping.

It's not that uncommon. Even notepad uses it:

https://blogs.msdn.microsoft.com/oldnewthing/20180521-00/?p=...

frutiger · on Oct 20, 2018

> Frankly on windows this is AFAIK a mostly-unheard-of technique. No one does mapping.

Just a data point, but we use memory mapping (both with and without names) in our product that runs on Windows.

xtrapolate · on Oct 20, 2018

> "Frankly on windows this is AFAIK a mostly-unheard-of technique. No one does mapping."

Unheared by you.

In a world consumed by Electron apps and Javascript, lots of people couldn't care any less about performant IPC, sharing data across processes, and multiprocessing in general.

vardump · on Oct 20, 2018

> Frankly on windows this is AFAIK a mostly-unheard-of technique. No one does mapping.

Really? Most larger (Windows) codebases I know of use memory mapping.

hyperman1 · on Oct 20, 2018

> Frankly on windows this is AFAIK a mostly-unheard-of technique. No one does mapping.

This must be incorrect.

Jeffrey Richter's advanced windows, a very widespread win32 book, introduced this technique to a LOT of developers, just when windows was the hot stuff for programmers. I don't think I've seen a win32 code base >100 Kloc without mapping. Understanding and using it is one of those rites of passage that marks the switch from junior to medior win32 developer.

chris_wot · on Oct 20, 2018

LibreOffice does quite a bit of memory mapping. So does everything else.

bsenftner · on Oct 20, 2018

MMapping is used in most fault tolerant software as a simplified method of data persistence. Granted, not the only method in play, but it is a common data safety net.

ajross · on Oct 20, 2018

Flushing behavior is precisely the kind of thing that varies the most between systems. I'd be very suspicious of a "fault tolerant" system that tried to use a library like this to be "cross platform". That's almost a contradiction in terms.

bsenftner · on Oct 20, 2018

Was not talking about that library, just memory mapped files.