Common Voice: A Massively-Multilingual Speech Corpus

est31 · on Dec 27, 2019

> M-AILABS data contains 9 language varieties with a modified BSD 3-Clause License, however there is no community-driven aspect.

There is certainly an aspect, as M-AILABS sources their data mainly from the librivox project, which is community-driven.

To give ballpark numbers on what it would have cost if you had had to pay people for providing the data instead of getting it for free:

It's low skilled labour so you'll likely find people to do it slightly above minimum wage. Let's take Germany as I'm most familiar with its rules and because that's where Common Voice is headquartered. Minimum wage here will be €9.35 / hr starting on Jan 1st. Let's say you pay them €11. There are various Arbeitgeberanteile which you have to pay as well. Let's say your per-employee expense would amount to €15/hour. Let's assume you can verify and record at 70% efficiency and you use two people to verify. Then you need to expend 4.29 employee-hours per final result hour.

You couldn't just get German at this rate: Berlin is one of the towns with the largest language diversities in Germany.

This would give you a price of €64.35 per result hour. You'd have to pay €64k for 1000 hours of validated training data, and 128k for the 2 thousand hour figure of currently achieved data.

These €128k are probably on the same order of magnitude that Mozilla pays for the project (employee time to design, build, and run it), and if the project scales it will look even better. From a business POV, going open source was thus a great idea.

To put the 2000 hours into comparison, the deepspeech 2 paper [1] used 10k hour datasets (per language, while the 2k hours are distributed amongst multiple languages) [1]. Record holder is probably Amazon with 1 million hours (although it's unlabeled) [2].

It's possible though that future breakthroughs will remove the need for tons of training data. So even if the restricted amount of training data can't create practical models in niche languages for now, it might very well be able to in the future.

[1]: https://arxiv.org/pdf/1512.02595.pdf

[1]: https://arxiv.org/pdf/1904.01624.pdf

melling · on Dec 27, 2019

I’m not sure why we’d bother to figure the cost? People have been donating their time to open source for decades, without compensation.

I’ve donated some time to the Mozilla Common Voice project: https://voice.mozilla.org/en

If a lot of people donate a little time, we’ll come much closer to making voice recognition a solved problem.

Google has a great project where they’re trying to improve voice recognition for people with disabilities:

https://blog.google/outreach-initiatives/accessibility/how-t...

est31 · on Dec 27, 2019

It was meant to demonstrate that open sourcing a project can help a company with its realization instead of doing nothing or holding it back.

peterjussi · on Dec 27, 2019

The actual dataset from the paper: https://voice.mozilla.org/en/datasets

zerop · on Dec 27, 2019

I have trained with common voice data and it certainly does very good..however a decent speech recognition has still a long way to go... These models work good in controlled environment with a mic...but real world use cases are having noise, dynamic environment, varied pitches etc... on top of it everyone expects your voice recognition to be like Google or Alexa... I am still looking for a decent deep learning based solution that can work in a real environment..

Jnr · on Dec 27, 2019

They actually ask to use whatever microphone in any environment you are in.

I usually record my clips on the phone in car when sitting in the traffic. It is not even close to studio quality.

Currently my contribution is small (about 1500 clips and 1000 checks) but I still participate when I get some time.

lunixbochs · on Dec 27, 2019

This dataset has been a huge boon for my English acoustic models that recognize many accents at once.

option · on Dec 27, 2019

This is a great dataset. We already pre-trained some models on it https://nvidia.github.io/NeMo/asr/quartznet.html

zerop · on Dec 27, 2019

How is the performance?