The audio Mozilla DeepSpeech is trained on is not very large (about 2000 hours) ...

The audio Mozilla DeepSpeech is trained on is not very large (about 2000 hours) or diverse (eg: mostly native American English Male voices) and has very little ability to handle noise, accents or other errata.

Comparatively, Baidu had 5000 hours of English to train their versions of DeepSpeech and DeepSpeech2 on, and thus had better results years ago. Google, Microsoft, IBM and other companies have users providing more audio samples on a daily basis, enabling much better quality speech to text.

Mozilla's Common Voice project only has 1492hrs of validated English currently: https://commonvoice.mozilla.org/en/datasets