How to assemble a German speech recognition program in under 300 lines.
-
ps: I skimmed the paper cited and what I wrote above is -not- correct. The project is not simply assembling a pipeline, it is claiming ('papers with code') an innovation:
"This paper presents TEVR, a speech recognition model designed to minimize the variation in token entropy w.r.t. to the language model. This takes advantage of the fact that if the language model will reliably and accurately predict a token anyway, then the acoustic model doesn't need to be accurate in recognizing it."
...
"We have shown that when combined with an appropriately tuned language model, the TEVR-enhanced model outperforms the best German automated speech recognition result from literature by a relative 44.85% reduction in word error rate. It also outperforms the best self-reported community model by a relative 16.89% reduction in word error rate."
Yes, we actually did research and spent a significant amount of GPU time. Thanks to my luck of stumbling into the right people at the right time, I could afford cloud-scale training because OVH granted me very generous rebates...
The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
Yes and no. With perfect audio quality, it'll write down almost verbatim what you said. But as the audio gets more noisy, it'll shift more and more towards the most likely interpretation.
> The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
> We don't train what you don't need to hear
This does sound a lot more interesting than the ~280 lines of code.
For a researcher, yes. But for understanding the trick there, you need to have read and understood the CTC loss paper.
For people like my industry clients, on the other hand, "code that is easy to audit and easy to install" is a core feature. They don't care about the research, they just want to make audio files search-able.
-
ps: I skimmed the paper cited and what I wrote above is -not- correct. The project is not simply assembling a pipeline, it is claiming ('papers with code') an innovation:
"This paper presents TEVR, a speech recognition model designed to minimize the variation in token entropy w.r.t. to the language model. This takes advantage of the fact that if the language model will reliably and accurately predict a token anyway, then the acoustic model doesn't need to be accurate in recognizing it."
...
"We have shown that when combined with an appropriately tuned language model, the TEVR-enhanced model outperforms the best German automated speech recognition result from literature by a relative 44.85% reduction in word error rate. It also outperforms the best self-reported community model by a relative 16.89% reduction in word error rate."