How to assemble a German speech recognition program in under 300 lines. - ps: I ...

fxtentacle · on Aug 10, 2022

Yes, we actually did research and spent a significant amount of GPU time. Thanks to my luck of stumbling into the right people at the right time, I could afford cloud-scale training because OVH granted me very generous rebates...

The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:

We don't train what you don't need to hear

If you want to play around with the TEVR tokenizer design, here's the source for that: https://huggingface.co/fxtentacle/tevr-token-entropy-predict...

tialaramex · on Aug 10, 2022

So does this means the recogniser would be worse at recognising unexpected utterances, which is roughly what you'd see with human recognition ?

What's the German equivalent of "How to Wreck a Nice Beach"?

fxtentacle · on Aug 11, 2022

Yes and no. With perfect audio quality, it'll write down almost verbatim what you said. But as the audio gets more noisy, it'll shift more and more towards the most likely interpretation.

yorwba · on Aug 10, 2022

Eishockey, Kanufahren, Wirsing.

(Alles okay. Kann noch fahren. Wiedersehen!)

cgeier · on Aug 10, 2022

> The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:

> We don't train what you don't need to hear

This does sound a lot more interesting than the ~280 lines of code.

fxtentacle · on Aug 10, 2022

For a researcher, yes. But for understanding the trick there, you need to have read and understood the CTC loss paper.

For people like my industry clients, on the other hand, "code that is easy to audit and easy to install" is a core feature. They don't care about the research, they just want to make audio files search-able.