Serious question from a guy made soft by garbage collection: how frequent is mem...

asveikau · on April 6, 2014

> how frequent is memory allocation failure nowadays

I'd guess that it varies a lot by domain and project but from what I've seen, pretty common.

> I'd think that if allocs began to fail, there was no recovery anyhow

I think this is what both high-level languages and the Linux "over-commit-by-default" policy have convinced people is the normal behavior. However in my experience it's not that hard to make OOM simply bubble up the stack and have all the callers up the stack free their resources, then let the rest of the program keep running. It doesn't have to be a catastrophic event. You just have to be consistent about handling it, and write code expecting it.

> Are there lots of ways allocation can fail besides low memory conditions?

To think of a few, there's running out of memory, but there's also running out of address space. The latter is not so hard to accomplish on a 32-bit system. You could ask for a chunk of memory where, if you could coalesce all the free space throughout the heap, you may have enough space, but you can't make it into a contiguous allocation.

On Windows I've also seen the kernel run out of nonpaged pool, which is more space constrained than the rest of memory. I've seen this when a lot of I/O is going on. You get things like WriteFile failing with ERROR_NOT_ENOUGH_MEMORY.

twic · on April 6, 2014

On Linux, somewhat infamously, malloc never fails. It will always return a pointer to some fresh part of the address space. It is able to do this because, in turn, sbrk/anonymous mmap never fails - it always allocates some fresh address space. It is able to do this because Linux does not allocate physical memory (or swap) when it assigns address space, but when that address space is actually used. It will happily allocate more address space than it has memory for - a practice known as 'overcommit'. So, on Linux, you can indeed not worry about malloc failing:

http://www.scvalex.net/posts/6/ http://www.drdobbs.com/embedded-systems/malloc-madness/23160...

There are a few caveats to this.

Firstly, malloc actually can fail, not because it runs out of memory, but because it runs out of address space. If have 2^64 bytes of memory in your address space already (2^48 on most practical machines, i believe), then there is no value malloc could return that would satisfy you.

Secondly, this behaviour is configurable. An administrator could configure a Linux system not to do this, and instead to only allocate address space that can be backed with memory. And actually, some things i have read suggest that overcommit is not unlimited to begin with; the kernel will only allocate address space equal to some multiple of the memory it has.

Thirdly, failure is conserved. While malloc can't fail, something else can. Linux's behaviour is essentially fractional reserve banking with address space, and that means that the allocator will sometimes write cheques the page tables can't cash. If it does, if it allocates more address space than it can supply, and if every process attempts to use all the address space that it has been allocated, we have the equivalent of a run on the bank, and there is going to be a failure. The way the failure manifests is through the action of the out-of-memory killer, which picks one process on the system, kills it, and so reclaims the memory allocated to it for distribution to surviving processes:

http://linux-mm.org/OOM_Killer

The OOM killer is a widely-feared bogeyman amongst Linux sysadmins. It sometimes manages to choose exactly the wrong thing as a victim. At one time, and perhaps still, it had a particular grudge against PostgreSQL:

http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/

And in the last month or so, on systems where i work, i have seen a situation where a Puppet run on an application server provoked the OOM killer into killing the application, and another where a screwed up attempt to create a swap file on an infrastructure server provoked it into killing the SSH daemon and BIND.

I don't know about what other operating systems do. Apparently all modern unixes overcommit address space in much the same way as Linux. However, i can't believe that FreeBSD handles this as crassly as Linux does.

binarycrusader · on April 7, 2014

  I don't know about what other operating systems do.
  Apparently all modern unixes overcommit address space in
  much the same way as Linux. However, i can't believe that
  FreeBSD handles this as crassly as Linux does.

Solaris does not (generally, unless you use MAP_NORESERVE w/ mmap).

In general, the Linux kernel's default OOM behaviour is undesirable for the vast majority of enterprise use cases. That's why RedHat and many other vendors used to disable it by default (unknown who still does).

Why is it bad? Simple; imagine your giant database is running with a large address space mapped. Another program decides to allocate a large amount of memory. The Linux kernel sees the tasty database target, kills it, and gives the smaller program its memory. Congratulations, your database just went poof.

There's an article that discusses the advantages/disadvantages with respect to Solaris here:

http://www.oracle.com/technetwork/server-storage/solaris10/s...

The article (while old) still applies today.

otterley · on April 6, 2014

> On Linux, somewhat infamously, malloc never fails. It will always return a pointer to some fresh part of the address space. It is able to do this because, in turn, sbrk/anonymous mmap never fails - it always allocates some fresh address space. It is able to do this because Linux does not allocate physical memory (or swap) when it assigns address space, but when that address space is actually used. It will happily allocate more address space than it has memory for - a practice known as 'overcommit'. So, on Linux, you can indeed not worry about malloc failing

True. However, you can disable this behavior if you like by running 'sysctl vm.overcommit_memory=2'; see proc(5).

asveikau · on April 6, 2014

> On Linux, somewhat infamously, malloc never fails.

Pretty close to true but I think that is a bit of a simplification. I seem to recall for instance on 32-bit Linux it's not hard to get malloc to return NULL: ask for some absurd size, like maybe a few allocations of a gigabyte or two, something that fits in a size_t but a 32-bit address space could not possibly accommodate with all the other things in the address space (stacks, your binary, libraries, kernel-only addresses in the page table, etc).

ams6110 · on April 7, 2014

oom killer always seems to target sshd in my encounters. Handy when you're at home.

joosters · on April 6, 2014

Most Linux distributions run using an optimistic memory allocation system, whereby memory (RAM plus swap space) can be over-allocated. On these systems, your program can die due to lack of memory at any point in time. I.e. Even if you test the return values of every malloc() call, you still won't be safe.

smtddr · on April 6, 2014

I did not believe you, but then I did "man malloc" and sure enough, in the NOTES section at the bottom.

>>By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available.

So it's like airlines overbooking seats; the system just hopes that the memory is available when you actually try to use it. I had no idea. That would be an extremely annoying bug to try and track down. How would one even do it? Is there a way to test if you truly have the memory without segfaulting?

comex · on April 6, 2014

If the memory isn't available, it doesn't segfault; it blocks until memory is available, thanks to either swap or the OOM killer doing its thing.

If you want to ensure that you don't block, you can use mlock().

joosters · on April 6, 2014

That's not quite true. The problem is more insidious that that.

Think of the memory requirements of the fork() system call. It clones the current process, making an entire copy of it. Sure, there's lots of copy-on-write optimisation going on, but if you want guaranteed, confirmed memory, you need to have a backing store for that new process. The child process has every right to adjust all of its process memory.

So if a 4GB process calls fork(), you will suddenly need to reserve 4GB of RAM or new swap space for it. Or if you can't allocate that, you will have to make the fork() fail.

This can be terrible for users, since most often a process is going to fork() and then exec() a very small program. And it seems nonsensical for fork() to fail with ENOMEM when it appears that there is lots of free memory left. But to ensure memory is strictly handled, that's what you have to do.

The alternative, which most distributions use, is to optimistically allocate memory. Let the fork() and other calls succeed. But you run the risk of the child process crashing at any point in the future when it touches its own 'confirmed' memory and the OS being unable to allocate space for it. So the memory failures are not discovered at system call points. There's no return code to spot the out of memory condition. The OS can't even freeze up until memory is available because there's no guarantee that memory will become available.

smtddr · on April 6, 2014

Well that's even more interesting! So you can have a program appear to be stuck and not know why! At least now I know I can use mlock() everywhere to determine if it locked on a write to promised-but-not-yet-available memory.

joosters · on April 6, 2014

No. All you can do is try writing to it and see what happens (you can catch the segfault though).

http://opsmonkey.blogspot.co.uk/2007/01/linux-memory-overcom... has some more info.

srean · on April 6, 2014

Failure or not, this is the highway to shitty software with bad user experience (except for very special cases where it makes sense).

For me the funniest part has been that the people who seem entitled to write sloppy software are the exact same set who would have the shrillest voices complaining that firefox is so slow and bloated (although its not anymore)

Many believe that its OK to hog memory, that it is an infinite resource. Many believe it is OK to be slow as long as it meets specs. Many believe your user application is the only application that the user will be running at any point in time. However, when your competition does it leaner and faster, you (not you personally, a generic software) are mostly going to be toast.

ithkuil · on April 7, 2014

many people believe that over engineering is bad. It doesn't mean that it's OK to have crappy software, but that you should focus on things that matter. In other words, it's OK to do X until it's not.

On the other hand that's not an excuse for not understanding how things work and just having faith in some magic layer that somehow would just handle things for you. Doing so it could make it impossible to improve those parts of the system that are important without rewriting everything.

fprawn · on April 6, 2014

Memory allocation failures are virtually non-existent in modern desktop computers. Good practice is to not test return values from malloc, new, etc.

Memory can be allocated beyond RAM size, so by the time a failure occurs your program really should crash and return its resources.

Embedded systems have fewer resources and some will not have virtual memory and so the situation will be different. But unless you know better, the best practice is still to not check the return from allocators. Running out of memory in a program intended for an embedded platform should be considered a bug.

cbab · on April 6, 2014

I respectfully disagree with this. Ignoring return values is _not_ good practice. It is a slippery slope to bad software. By catching these memory errors, a program has the chance to properly teardown and report a message to the user instead of crashing.

yaur · on April 6, 2014

Ugh. I would much rather know that the process died because of allocation failure than try to figure out why some code is trying to write to a random null pointer as these are two very different types of bugs.

htns · on April 6, 2014

I'm having a hard time picturing a situation where it would be tough to figure out. Typically you allocate memory to use it right after. errno will be set too.

Of course, there is no reason to not do all your allocation through a wrapper function which does check and abort on failure. I think the point was that surviving malloc failures is a dubious approach - instead go all in, or if it's a long-running service, provide a configurable max memory cap and assume that much will be available.

deathanatos · on April 6, 2014

In one scenario, the process writes "Out of memory." or similar to stderr. In the other, it segfaults. Maybe.

I'll take the clear error message.

yaur · on April 7, 2014

In the case of my day job (CCTV application 80% C#, 20% C and C++) writing to a bad pointer will get reported as an AccessViolationException with no hope of getting a dump or a stack trace of the native code. An allocation failure will get translated into an OutOfMemoryException and typically includes stats of what is consuming RAM.

fprawn · on April 6, 2014

I'll clarify this. I'm not saying you shouldn't ever check return values, that's obviously not the right thing to do. And of course there are exceptions to the general rule. If you're allocating a large chunk of memory and there's a reasonable expectation that it could fail, that should be reported, of course.

In the general case, however, if allocating 100 bytes fails, reporting that error is also likely to fail. An actual memory allocation failure on a modern computer running a modern OS is a very rare and very bad situation. It's rarely recoverable.

It's not bad to handle allocation failures, but in the vast majority of cases it's very unreasonable to do so. You can write code for it if you want, have fun.

And just to be completely clear, I am ONLY talking about calls to malloc, new, realloc, etc. NOT to OS pools or anything like that. Obviously, if you allocate a 4Mb buffer for something (or the OS does for you), you expect that you might run out. This is ONLY in regards to calls to lower level heap allocators.

I don't think you'll find any experienced programmer recommending that you always check the return from malloc. That's completely absurd. There are always exceptions to the rule, however.

asveikau · on April 6, 2014

> In the general case, however, if allocating 100 bytes fails, reporting that error is also likely to fail. An actual memory allocation failure on a modern computer running a modern OS is a very rare and very bad situation. It's rarely recoverable.

I call BS on this. First of all, it's not the 100 byte allocation that is likely to fail; chances are it's going to be bigger than 100 bytes and the 100 byte allocation will succeed. (Though that is not 100% either.) Second, the thing you're going to do in response to an allocation failure? You're going to unwind the stack, which will probably lead to some temporary buffers being freed. That already gets you more space to work with. (It's also untrue that you can't report errors without allocating memory but that's a whole other story...)

I suspected when I wrote in this thread that I'd see some handwavy nonsense about how it's impossible to cleanly recover from OOM, but the fact is I've witnessed it happening. I think some people would just rather tear down the entire process than have to think about handling errors, and they make up these falsities about how there's no way to do it in order to self justify... Although, when I think back to a time in which I shared your attitudes, I think the real problem was that I hadn't yet seen it being done well.

fprawn · on April 7, 2014

If you have time, can you expound on this? Is there, perhaps, an open source project that handles NULL returns from malloc in this way you could point me to?

asveikau · on April 7, 2014

My first instinct is to say look at something kernel-related. If an allocation fails, taking down the entire system is usually not an option (or not a good one anyway). Searching http://lxr.linux.no/ for "kmalloc" you see a lot of callers handling failure.

asveikau · on April 7, 2014

Adding a bit more after the fact: most well-written libraries in C are also like this. It's not a library's business to decide to exit the process at any time. The library doesn't know if it's some long-running process that absolutely must keep going, for example.

nknighthb · on April 7, 2014

I'm not sure what you imagine when you say "modern computer running a modern OS". Does this not include anything but desktop PCs and laptops? Because phones and tablets have some rather nasty memory limits for applications to deal with, which developers run into frequently.

The space I work in deals with phones and tablets, as well as other embedded systems (TVs, set-top boxes, etc.) that tend to run things people think of as "modern" (recentish Linux kernels, userlands based on Android or centered around WebKit), while having serious limits on memory and storage. My desktop calendar application uses more memory than we have available on some of these systems.

In these environments, it is essential to either avoid any possibility of memory exhaustion, or have ways to gracefully deal with the inevitable. This is often quite easy in theory -- several megabytes of memory might be used by a cached data structure that can easily be re-loaded or re-downloaded at the cost of a short wait when the user backs out of whatever screen they're in.

But one of the consequences of this cavalier attitude to memory allocation is that even in these constrained systems, platform owners have mandated apps sit atop an ever-growing stack of shit that makes it all but impossible for developers to effectively understand and manage memory usage.

pjmlp · on April 6, 2014

That is the open path to security exploits.

aktau · on April 7, 2014

See xmalloc [1] and friends, courtesy of the Git project.

[1]: https://github.com/git/git/blob/master/wrapper.c