This model seems strongly overtrained on CV test set. Usually improvement from LM rescoring is just 10% relative. In the paper https://arxiv.org/pdf/2206.12693.pdf the improvement is from 10.1% WER to 3.64% WER (Table 6). Such a big improvement suggests that LM is biased.
Also, the perplexity of provided ngram LM on CV test set is just 86 and most of 5-gram histories are already in the LM. This also suggests bias.
Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM. In comparison to that, I found TEVR going from 10.10% to 3.64% unproblematic. The core assumption of my paper is that for German specifically, the language model is very important due to conserved (and usually mumbled) word endings.
Anyway, it's roughly a 64% reduction for both wav2vec2 XLS-R and TEVR. So if your criticism that I overtrained the TEVR model turns out to be correct, then that would suggest that the Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained, which would still make it a fair comparison w.r.t. the 16% relative improvement in WER.
Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.
Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)
BTW, regardless of the metrics, this is the model that "works for me" in production.
BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.
First of all thank you for your nice research! It is really inspiring.
> Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM.
It is probably Jonatas Grosman's model, not Facebook. Bias is a common sin for common voice trainers. Partially because they integrate Guttenberg texts into LM, partially because for some languages CV sentences intersect between train and test.
The improvement from LM is from 6.68 to 6.03 as expected.
> Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained
Yes
> Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.
Not all the models are overtrained, I mainly complain about German ones. For example Spanish is reasonable:
> Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)
> BTW, regardless of the metrics, this is the model that "works for me" in production.
Sure, but it could work even better if you take more generic model.
>BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.
I now had time and did some testing and the CER is already pretty much excellent for TEVR even without the language model, so it appears to me that what the LM does is mostly to fix the spelling. In line with that, recognition performance is still good for medical words, but in some cases the LM will actually reduce quality there by "fixing" a brand-name to a sequence of regular words.
Thanks for the perplexity paper :) I'll go read that now.
Sounds interesting, although as someone not that deeply into ML these terms don't say a lot to me. What would "bias" mean in this case? That the model would recognize a "standard German" speaker, but not someone from Bavaria? Because that happens to a lot of (non-Bavarian) Germans too.
The model knows the recognition text well and demonstrates good results because of it. If you test the same model on some unrelated speech which model didn't see yet the results will not be that great. Error rate might be significantly worse than other systems.
Also, the perplexity of provided ngram LM on CV test set is just 86 and most of 5-gram histories are already in the LM. This also suggests bias.