> M-AILABS data contains 9 language varieties with a modified BSD 3-Clause License, however there is no community-driven aspect.
There is certainly an aspect, as M-AILABS sources their data mainly from the librivox project, which is community-driven.
To give ballpark numbers on what it would have cost if you had had to pay people for providing the data instead of getting it for free:
It's low skilled labour so you'll likely find people to do it slightly above minimum wage. Let's take Germany as I'm most familiar with its rules and because that's where Common Voice is headquartered. Minimum wage here will be €9.35 / hr starting on Jan 1st. Let's say you pay them €11. There are various Arbeitgeberanteile which you have to pay as well. Let's say your per-employee expense would amount to €15/hour. Let's assume you can verify and record at 70% efficiency and you use two people to verify. Then you need to expend 4.29 employee-hours per final result hour.
You couldn't just get German at this rate: Berlin is one of the towns with the largest language diversities in Germany.
This would give you a price of €64.35 per result hour. You'd have to pay €64k for 1000 hours of validated training data, and 128k for the 2 thousand hour figure of currently achieved data.
These €128k are probably on the same order of magnitude that Mozilla pays for the project (employee time to design, build, and run it), and if the project scales it will look even better. From a business POV, going open source was thus a great idea.
To put the 2000 hours into comparison, the deepspeech 2 paper [1] used 10k hour datasets (per language, while the 2k hours are distributed amongst multiple languages) [1]. Record holder is probably Amazon with 1 million hours (although it's unlabeled) [2].
It's possible though that future breakthroughs will remove the need for tons of training data. So even if the restricted amount of training data can't create practical models in niche languages for now, it might very well be able to in the future.
I have trained with common voice data and it certainly does very good..however a decent speech recognition has still a long way to go... These models work good in controlled environment with a mic...but real world use cases are having noise, dynamic environment, varied pitches etc... on top of it everyone expects your voice recognition to be like Google or Alexa... I am still looking for a decent deep learning based solution that can work in a real environment..
There is certainly an aspect, as M-AILABS sources their data mainly from the librivox project, which is community-driven.
To give ballpark numbers on what it would have cost if you had had to pay people for providing the data instead of getting it for free:
It's low skilled labour so you'll likely find people to do it slightly above minimum wage. Let's take Germany as I'm most familiar with its rules and because that's where Common Voice is headquartered. Minimum wage here will be €9.35 / hr starting on Jan 1st. Let's say you pay them €11. There are various Arbeitgeberanteile which you have to pay as well. Let's say your per-employee expense would amount to €15/hour. Let's assume you can verify and record at 70% efficiency and you use two people to verify. Then you need to expend 4.29 employee-hours per final result hour.
You couldn't just get German at this rate: Berlin is one of the towns with the largest language diversities in Germany.
This would give you a price of €64.35 per result hour. You'd have to pay €64k for 1000 hours of validated training data, and 128k for the 2 thousand hour figure of currently achieved data.
These €128k are probably on the same order of magnitude that Mozilla pays for the project (employee time to design, build, and run it), and if the project scales it will look even better. From a business POV, going open source was thus a great idea.
To put the 2000 hours into comparison, the deepspeech 2 paper [1] used 10k hour datasets (per language, while the 2k hours are distributed amongst multiple languages) [1]. Record holder is probably Amazon with 1 million hours (although it's unlabeled) [2].
It's possible though that future breakthroughs will remove the need for tons of training data. So even if the restricted amount of training data can't create practical models in niche languages for now, it might very well be able to in the future.
[1]: https://arxiv.org/pdf/1512.02595.pdf
[1]: https://arxiv.org/pdf/1904.01624.pdf