Debugging severe memory corruption or memory leaks is annoying, and can occasion...

Debugging severe memory corruption or memory leaks is annoying, and can occasionally take a lot of time, but it's not necessarily that bad. Here are some pointers that may be helpful.

Tools: valgrind and gdb are obvious. But don't forget your compiler! Crank up the warnings, and look through LLVM-clang's -fsanitize=<foo> and warning options. (Also, if you're already on OpenBSD, check out the "S" flag to malloc; if you're on Solaris, check out, well, the blog post.) Finally, Boehm's conservative garbage collector has a "find memory leaks" mode, which looks useful for those cases where you can't get valgrind working. If all else fails, shovel through the memory dump looking for repeated patterns.

Testing: try to reproduce the problem; the first iteration may look something like "it runs out of memory after 36 hours". Then simplify: for instance, the author of the article could have asked "does this still happen if the server closes the connection immediately, without sending any data" and would have found the bug very quickly. (Of course, you're likely to ask a lot of wrong questions before hitting on the right one; experience and a full knowledge of the system you're working on is useful but not sufficient.) Questions like "does this happen more quickly if we ping 100 times per second instead of once every ten minutes" are often useful as well. (Finally, just printing memory usage every N seconds is helpul.)

Coding: be careful when writing code. The usual ways of improving code quality (e.g. code reviews) work to reduce memory leaks, too. Try to run a multiple-hour soak test every so often during development (preferably on a CI server); it's a lot easier to debug "hey, we suddenly run out of memory after yesterday's commits" than "well, something goes wrong in production". If you're doing new development, consider alternatives to malloc() - arena/pool allocation (e.g. libtalloc) is convenient and very fast if your memory use is tree-like (e.g. a connections owns a request owns some memory to sort the data before returning it). In C, goto a single chunk of cleanup-and-return code rather than duplicating the cleanup at every place where you exit from the function.