I've never been very satisfied with these approaches for C where you hope the compiler does the right thing. It makes sense to provide some C implementation for portability's sake but any sizeable reordering cries out for a handtuned, processor specific, approach (and the non-sizeable probably doesn't require high speed). I would expect any SIMD instruction set to include a shuffle.
It can also be a good idea to swap recursively. First swap the upper and lower half, then swap the upper and lower quarters (bytes for a 32bit) which can be done with only 2 masks. Then if its 64bit value swap alternate bytes, again with only 2 masks. This can be extended all the way to full bit reverse in 3 more lines each with 2 masks and shifts.