The goal of making a large, publicly available training corpus for ASR is incredibly admirable, but this approach is problematic. People speak entirely differently when reading from a script. Models trained on read speech (like LibriSpeech) generally don't perform well on spontaneous speech test sets (like Switchboard). Transcribing speech that was read from a script isn't a particularly interesting problem.
This effort would be more interesting if it could collect speech data in a more specific domain, like web search queries.
I have many general English speech recognition models for wav2letter trained on ~3000 hours of non-switchboard audio including 1k from Common Voice, and on several datasets I have collected or found myself.
Common Voice has optional fields for age and accent in the dataset, but I don't use them. I just toss everything into a single bin.
I've had users with many accents (including e.g. thick UK, India, Germany), report that my 3k hour model including Common Voice performed significantly better for their accent than a librispeech-only model, and also much better than the older macOS speech recognition English models. (And how a model feels for a user in the real world is something you can't really get by just testing against switchboard)
I can't afford switchboard, so I'm very glad that datasets like Common Voice, librispeech, and TED-LIUM exist. I hope at some point I can get the model running in a website, it really does feel pretty good to use.
For Talon we developed a custom speech collection site, that has a lot of prompts we control: https://speech.talonvoice.com - I tried it out with the TIMIT prompt list initially, but right now I'm recording dense command-like speech, which I've found some of my more experienced users are able to say naturally/quickly, not like they're reading from a prompt.
I don't have a separate verification process, because I've had a lot of success with a process that automatically prunes inputs that "obviously make the model much worse". The site is basically designed to record as fast as possible, with keyboard shortcuts to go to the next item and start/stop recording so you can almost record nonstop, which ends up being slightly less forced than just reading one sentence.
The site is pretty reusable, I previously used it at noise.talonvoice.com to record "noise recognition" samples. Here's the source of the current speech site, if someone wants to spin up a common-voice-lite for a specific domain it's pretty easy (just need to run a python app somewhere): https://github.com/talonvoice/noise/tree/speech-dataset
I'm also using test sets to measure my models, including test sets from datasets I'm not training on. But I called it out because the real-world difference is _significant_ for this task, users with heavy accents previously couldn't say most mid-sized sentences without having to try again and "americanize" their voice for at least one of the words. With Common Voice mixed in, that wasn't the case.
As the name implies, LibriTTS is meant for people (researchers, really) to develop text-to-speech (TTS) systems.
The point isn't that LibriSpeech isn't clean enough. Rather, it's that conversational speech is very different from read speech (which is based on written text). Everything has an effect, even planning of utterances (think: hesitations, "uhm", "uh"), turn taking behaviors (think: how speakers negotiate taking turns), how speakers self-correct, phonetic convergence (think: speakers adapting their speech to be more similar to that of their interlocutor), and so on.
The Common Voice data won't help with that, as it's read speech. It's far more expensive to collect conversational speech datasets, as transcription (or correction of automatic transcripts) involves a lot of manual labor.
Why do you view this as admirable? I view it similarly to creating a large cache of explosives and munitions and then giving them away for free. Sure a lot of people might have some fun with them on the weekend but the largest impact of this technology will be used against people, not for them.
With your line of reasoning, would you prefer a world where all the munitions were held by megavillains, or a world where both megavillains and regular people had access to the munitions?
I've tried to contribute to new languages, but they all have "join us" button that subscribes to a generic Mozilla mailing list, and the Voice team never sent any instructions how to contribute! No wonder these are stuck unfinished for years.
Sometimes it's beyond me how tech savy people are struggling with the easiest things online.
The page itself has only very few links. Randomly clicking through it logged in would have a high chance hitting one of the links that would either lead you directly to one of the links where you can CONTRIBUTE: https://voice.mozilla.org/de/speak, your dashboard where you can also find links to the CONTRIBUTE part or just the huge microphone symbol.
I'm talking about new languages, which haven't been launched yet, because they don't have enough text to record yet.
The actual method to contribute to them is finding Mozilla's internal translation tool, proposing translations, then finding a Mozillian to approve them. Then there's an obscure webapp for sentence submission and voting, but it doesn't give any feedback whether sentences you've submitted were accepted or went to /dev/null.
(you see people struggling with easiest things, because you assume people struggle with easiest things, rather than that you've misunderstood the problem.)
There is feedback in sentence collector. If your sentence too long it will tell you straight away. If your sentence is rejected by people it will appear in "rejected sentences" tab. If your sentences are approved you can see it in GitHub repo.
What GitHub? Please realize that none of this is obvious to an outsider who doesn't know how your pipeline is built. None of this is explained on the Voice website. These tools aren't even linked to on the official website.
I've submitted 200 sentences, and none of the progress counters shown to me increased by 200, so I assumed they were lost, and gave up.
In mozilla/voice-web/server/data/ but if at your profile tab in sentence-collector there are 0 sentences added than that means that your sentences hadn't been added even to sentence-collector for approval.
The people that run the voice program at Mozilla seem to be very disconnected from the rest of Mozilla. They are essentially an "innovation team" that jump from project to project across the company acting as almost an agency inside Mozilla. Seems this model has really made projects like this suffer because there isn't a team always there to iterate on the project and make it better.
They just not very clear. If you language "in progress" then you need collect 5k sentences. They have special web app called "sentence collector". In which you can approve sentences or add new ones. Each sentence should be approved by 3 people. After sentences are approved they get released to GitHub. After 5k sentences are approved you will be able to record audio and listen to recordings as in any other "big" language. As practice shows you need very small amount of people to do it. 4-5 is enough.
I love the project and have participated. What I feel lack is cases/words that kids are able to pronounce (with help of parents saying what they have to repeat).
My 3-year old son love to use speech translations and searching for videos, but it frustrates both me and him that he pronounce it in a way that "all" humans would understand but the voice-to-text get his voice wrong in 3/4 of the cases.
Collecting data from children is nearly impossible for legal reasons. I mostly agree with the reasons, but the side effect is nobody has good data on children. Thus children are forever doomed to a bad experience, like the time my son asked "Hey Mycroft, how do you spell Kansas" "C-A-N-V-A-S" which he knew was wrong.
Maybe we could specifically target voice actors who convincingly voice children in animated stuff
Edit:
- There are already paid child actors, you could somewhat-manually collect their speech from e.g. movies and TV shows to have _something_
- Even if there's some copyright issue with distributing their audio directly, it's not clear (uncertain but dubious?) that a model trained on their audio would have any copyright concerns as long as it can't be used to reproduce the original audio
- What is Mozilla going to do if <1% of their dataset is already children who didn't put an age in? Is that a COPPA violation? There's even the defense of "an adult can sound like a child and we also don't know who this child is so how is it personal information" (I have no idea the usefulness of any of that)
I did some of the "listen" exercises to validate how people pronounced some sentences, and I got a few people who spoke with very strong (Indian, Nigerian, UK, ...) accents. How would you take these things into account? Just average all of them and hope for the best? Not sure how to approach that. I don't think it's very straight-forward. Interesting problem though.
However, you can't do anything if you don't even have the data, so props to Mozilla for starting this.
The accents are sometimes tagged in the metadata csv files. So it’s possible to filter some of them out.
Mozilla DeepSpeech used to release checkpoints trained on Common Voice along with a few others (LibriSpeech etc). But they’ve dropped CV in the latest releases and just rely on the others.
I think the others are more standardised in terms of accents.
It’s likely possible to fine tune a model with different accents — so long as the language is the same then the model can up date the phonemes it recognises.
In the worst case, you might have to treat different accents as different, but related languages. The current trend for low-resource languages seems to be about using one giant model for all languages to make use of shared features (e.g. for translation [1]), so adding even more languages might not be that expensive in terms of training data required.
Locales can help distinguish country-specific variants (e.g. US vs. UK spelling), but pronunciation varies a lot at the sub-national level. If you want to support all possible accents, you'll need a more fine-grained encoding.
I think locales are for dialects, where you can have different terms used for the same concept. Here you can have someone speak the en-NZ dialect, but with a French accent.
Also, we would need en-FR, en-ES, en-IT, etc. All languages as spoken from all other native languages. And obviously the strength of the accent varies.
Yeah, also a Moravian (ancient nation united with the Czechs more than 1000 years ago) person will speak English differently than a Czech (as in nation, not state) person even though we all speak the Czech language.
A solution could be to tag both the dialect and the accent with language codes. Native speakers of Moravian Czech will probably have similar accents when they speak New Zealand English. Using Glottolog IDs as tags for example, this might be represented as { dialect:"newz1240", accent:"czec1259" }. If the program can already recognise the New Zealand English dialect and the Moravian Czech accent, it might then leverage both of those to recognise the speech of a Moravian person speaking New Zealand English.
I don't know if it helps, but I've recorded myself quite a bit. In my profile I could set a native language and accent, as well as additional language and accent.
I've validated some speech and like others I sometimes found strong foreign accents from non native English speakers (just like I am) so I tried to be neutral at least with the sentences I could understand, which luckily were the vast majority.
If however I may offer a suggestion to improve the service, some information should be given about how to produce good quality recordings before the user starts contributing, what mic to use and how, sound levels, equalizing, background noise etc. Some of the recordings were truly awful quality wise and probably would generate false negatives (ok, an AI should learn to sort those out, but maybe later).
Also, some recordings although correct were stuttering badly probably due to network congestion on the contributors side; it could come handy a way to also tag those sentences as "correct but stuttering" so that in the future the AI could also learn how a formally well recited text would sound if coming from a problematic connection.
Tagging (scoring maybe) could also be useful for sentences where just about everything is correct save for a single word or part of it. For example, one sentence was "The party was a Sikh-centered political party in the Indian state of Punjab." but the woman said "in this Indian state". I didn't mark it because either way would have been not entirely accurate.
It is unclear to me what we're supposed to be validating the the "Listen" section.
1. That the words say what is written?
2. That the words are clear and easy to understand?
3. That the speech is fluent and natural and easy to listen to?
I would guess 1 and maybe 2, and not 3, but it's only a guess because I don't see it written down anywhere.
Many of the clips are very stilted, spoken slowly and unnaturally.(Also, sometimes the text doesn't make sense. "On a normal Hajj, it would be around to walk." But I'm guessing this doesn't matter.)
This effort would be more interesting if it could collect speech data in a more specific domain, like web search queries.