A serious amount of Redis crash reports are due to memory errors (we ask to test memory after crashes, since real segfaults are very rare, not everybody tests, tests may lead to false negative, so the problem is bigger than the one we observe).
This experiment also seems to show that there is a lot of corrupted memory out there.
There is a simple fix, I wonder why it is not used:
1) Add a feature at operating system level that one time every second tests N memory pages, at random. No observable performance degradation.
2) Report the problem to the user when it is found.
3) Mark the page as faked up and don't use it, ever.
So you have memory tested basically for free, users aware of their hardware errors, and a lot less consequences for those errors.
Guild Wars did this in user mode in the background while playing - it did background testing of the user's CPU, RAM, and GPU. Machines that failed the tests were flagged so that their crash reports got bucketed separately (saving us the time of trying to understand impossible crashes), and it popped up a message telling the user their computer was broken.
So even if the OS should be doing this for you, for long-running processes you could do it yourself in user mode. (I don't know if it's worth the effort, though.)
That's exactly my plan with Redis, and it is awesome to discover that it was used with success in the past! But I've a problem given that I can't access memory at a lower level, that is, when to test and what?
I've the following possibilities basically:
1) Test on malloc(), with a given probability, and perhaps only up to N bytes of the allocation, for latency concerns.
2) Do the same also at free() time.
3) From the time to time test just allocating a new piece of memory with malloc of a fixed size, test it, and free it again.
"3" is the one with the minimum overhead, but probably 1+2 have a bigger percentage of hitting all the pages, eventually...
I don't have a broken memory module to test different strategies, I wonder if there is a kenrel module that simulates broken mem.
Note that Redis can already test the computer memory with 'redis-server --test-memory' but of course this requires user intervention.
Even better, albeit more expensive, would be to start using ECC memory more frequently. It seems crazy to me that we're storing billions of bits of often important data in relatively dumb storage that can't even detect if it gets corrupted.
This experiment also seems to show that there is a lot of corrupted memory out there.
There is a simple fix, I wonder why it is not used:
1) Add a feature at operating system level that one time every second tests N memory pages, at random. No observable performance degradation.
2) Report the problem to the user when it is found.
3) Mark the page as faked up and don't use it, ever.
So you have memory tested basically for free, users aware of their hardware errors, and a lot less consequences for those errors.