Since music quality / stereo are not required, a speech codec could be used. I t...

Since music quality / stereo are not required, a speech codec could be used. I think this TSAC outperforms most of them on raw bit rate, but not energy efficiency and speed. E.g. SILK goes down to 6 kbps; that could be a contender.

Or maybe you do want really good quality in order to fingerprint the voices. Vocoder artifacts can give parties plausible deniability (that's not my voice).