A serious amount of Redis crash reports are due to memory errors (we ask to test...

kg · on Nov 18, 2012

Guild Wars did this in user mode in the background while playing - it did background testing of the user's CPU, RAM, and GPU. Machines that failed the tests were flagged so that their crash reports got bucketed separately (saving us the time of trying to understand impossible crashes), and it popped up a message telling the user their computer was broken.

So even if the OS should be doing this for you, for long-running processes you could do it yourself in user mode. (I don't know if it's worth the effort, though.)

antirez · on Nov 19, 2012

That's exactly my plan with Redis, and it is awesome to discover that it was used with success in the past! But I've a problem given that I can't access memory at a lower level, that is, when to test and what?

I've the following possibilities basically:

1) Test on malloc(), with a given probability, and perhaps only up to N bytes of the allocation, for latency concerns.

2) Do the same also at free() time.

3) From the time to time test just allocating a new piece of memory with malloc of a fixed size, test it, and free it again.

"3" is the one with the minimum overhead, but probably 1+2 have a bigger percentage of hitting all the pages, eventually...

I don't have a broken memory module to test different strategies, I wonder if there is a kenrel module that simulates broken mem.

Note that Redis can already test the computer memory with 'redis-server --test-memory' but of course this requires user intervention.

tectonic · on Nov 19, 2012

That's really smart.

mikeash · on Nov 18, 2012

Even better, albeit more expensive, would be to start using ECC memory more frequently. It seems crazy to me that we're storing billions of bits of often important data in relatively dumb storage that can't even detect if it gets corrupted.