Common Voice – Mozilla's initiative to help teach machines how real people speak

gok · on Oct 17, 2019

The goal of making a large, publicly available training corpus for ASR is incredibly admirable, but this approach is problematic. People speak entirely differently when reading from a script. Models trained on read speech (like LibriSpeech) generally don't perform well on spontaneous speech test sets (like Switchboard). Transcribing speech that was read from a script isn't a particularly interesting problem.

This effort would be more interesting if it could collect speech data in a more specific domain, like web search queries.

lunixbochs · on Oct 17, 2019

I have many general English speech recognition models for wav2letter trained on ~3000 hours of non-switchboard audio including 1k from Common Voice, and on several datasets I have collected or found myself.

Common Voice has optional fields for age and accent in the dataset, but I don't use them. I just toss everything into a single bin.

I've had users with many accents (including e.g. thick UK, India, Germany), report that my 3k hour model including Common Voice performed significantly better for their accent than a librispeech-only model, and also much better than the older macOS speech recognition English models. (And how a model feels for a user in the real world is something you can't really get by just testing against switchboard)

I can't afford switchboard, so I'm very glad that datasets like Common Voice, librispeech, and TED-LIUM exist. I hope at some point I can get the model running in a website, it really does feel pretty good to use.

For Talon we developed a custom speech collection site, that has a lot of prompts we control: https://speech.talonvoice.com - I tried it out with the TIMIT prompt list initially, but right now I'm recording dense command-like speech, which I've found some of my more experienced users are able to say naturally/quickly, not like they're reading from a prompt.

I don't have a separate verification process, because I've had a lot of success with a process that automatically prunes inputs that "obviously make the model much worse". The site is basically designed to record as fast as possible, with keyboard shortcuts to go to the next item and start/stop recording so you can almost record nonstop, which ends up being slightly less forced than just reading one sentence.

The site is pretty reusable, I previously used it at noise.talonvoice.com to record "noise recognition" samples. Here's the source of the current speech site, if someone wants to spin up a common-voice-lite for a specific domain it's pretty easy (just need to run a python app somewhere): https://github.com/talonvoice/noise/tree/speech-dataset

gok · on Oct 17, 2019

> how a model feels for a user in the real world is something you can't really get by just testing against switchboard

Fair but statistical systems are also extremely susceptible to placebo. I would be curious what a test set shows.

lunixbochs · on Oct 18, 2019

I'm also using test sets to measure my models, including test sets from datasets I'm not training on. But I called it out because the real-world difference is _significant_ for this task, users with heavy accents previously couldn't say most mid-sized sentences without having to try again and "americanize" their voice for at least one of the words. With Common Voice mixed in, that wasn't the case.

cnxhk · on Oct 17, 2019

They recently released a subset of LibriSpeech called LibriTTS which consists only clean speech.

woodson · on Oct 19, 2019

As the name implies, LibriTTS is meant for people (researchers, really) to develop text-to-speech (TTS) systems.

The point isn't that LibriSpeech isn't clean enough. Rather, it's that conversational speech is very different from read speech (which is based on written text). Everything has an effect, even planning of utterances (think: hesitations, "uhm", "uh"), turn taking behaviors (think: how speakers negotiate taking turns), how speakers self-correct, phonetic convergence (think: speakers adapting their speech to be more similar to that of their interlocutor), and so on.

The Common Voice data won't help with that, as it's read speech. It's far more expensive to collect conversational speech datasets, as transcription (or correction of automatic transcripts) involves a lot of manual labor.

lunixbochs · on Oct 18, 2019

Thanks for the notice! I hadn't seen this release yet.

cco · on Oct 17, 2019

Why do you view this as admirable? I view it similarly to creating a large cache of explosives and munitions and then giving them away for free. Sure a lot of people might have some fun with them on the weekend but the largest impact of this technology will be used against people, not for them.

staktrace · on Oct 17, 2019

With your line of reasoning, would you prefer a world where all the munitions were held by megavillains, or a world where both megavillains and regular people had access to the munitions?

minikites · on Oct 17, 2019

The latter strategy doesn't seem to work out that well for the USA because it results in regular people using the munitions on other regular people.

zenography · on Oct 17, 2019

To defend themselves, hundreds of thousands to millions of times per year.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3194685

williamdclt · on Oct 17, 2019

> indicate that defensive uses of guns by crime victims are far more common than offensive uses by criminals

That's literally the worst case scenario. Victims escalating the violence.

zenography · on Oct 18, 2019

Are you saying that people defending themselves is worse than killing?

rpmisms · on Oct 17, 2019

Many of the people here are optimistic about the use of tech for the betterment of mankind.

RussianCow · on Oct 17, 2019

> the largest impact of this technology will be used against people, not for them

How so?

pornel · on Oct 17, 2019

I've tried to contribute to new languages, but they all have "join us" button that subscribes to a generic Mozilla mailing list, and the Voice team never sent any instructions how to contribute! No wonder these are stuck unfinished for years.

Krasnol · on Oct 17, 2019

Sometimes it's beyond me how tech savy people are struggling with the easiest things online.

The page itself has only very few links. Randomly clicking through it logged in would have a high chance hitting one of the links that would either lead you directly to one of the links where you can CONTRIBUTE: https://voice.mozilla.org/de/speak, your dashboard where you can also find links to the CONTRIBUTE part or just the huge microphone symbol.

pornel · on Oct 17, 2019

I'm talking about new languages, which haven't been launched yet, because they don't have enough text to record yet.

The actual method to contribute to them is finding Mozilla's internal translation tool, proposing translations, then finding a Mozillian to approve them. Then there's an obscure webapp for sentence submission and voting, but it doesn't give any feedback whether sentences you've submitted were accepted or went to /dev/null.

(you see people struggling with easiest things, because you assume people struggle with easiest things, rather than that you've misunderstood the problem.)

ascii_only · on Oct 17, 2019

There is feedback in sentence collector. If your sentence too long it will tell you straight away. If your sentence is rejected by people it will appear in "rejected sentences" tab. If your sentences are approved you can see it in GitHub repo.

pornel · on Oct 18, 2019

What GitHub? Please realize that none of this is obvious to an outsider who doesn't know how your pipeline is built. None of this is explained on the Voice website. These tools aren't even linked to on the official website.

I've submitted 200 sentences, and none of the progress counters shown to me increased by 200, so I assumed they were lost, and gave up.

ascii_only · on Oct 18, 2019

In mozilla/voice-web/server/data/ but if at your profile tab in sentence-collector there are 0 sentences added than that means that your sentences hadn't been added even to sentence-collector for approval.

hustlinhack · on Oct 17, 2019

The people that run the voice program at Mozilla seem to be very disconnected from the rest of Mozilla. They are essentially an "innovation team" that jump from project to project across the company acting as almost an agency inside Mozilla. Seems this model has really made projects like this suffer because there isn't a team always there to iterate on the project and make it better.

rpmisms · on Oct 17, 2019

"The Easiest Things" is the refrain of poor UX advocates.

microcolonel · on Oct 17, 2019

I think the actual work like that happens on GitHub. I found it too annoying to get the thing up and running to contribute at all, so I gave up.

ascii_only · on Oct 17, 2019

They just not very clear. If you language "in progress" then you need collect 5k sentences. They have special web app called "sentence collector". In which you can approve sentences or add new ones. Each sentence should be approved by 3 people. After sentences are approved they get released to GitHub. After 5k sentences are approved you will be able to record audio and listen to recordings as in any other "big" language. As practice shows you need very small amount of people to do it. 4-5 is enough.

punnerud · on Oct 17, 2019

I love the project and have participated. What I feel lack is cases/words that kids are able to pronounce (with help of parents saying what they have to repeat). My 3-year old son love to use speech translations and searching for videos, but it frustrates both me and him that he pronounce it in a way that "all" humans would understand but the voice-to-text get his voice wrong in 3/4 of the cases.

bluGill · on Oct 17, 2019

Collecting data from children is nearly impossible for legal reasons. I mostly agree with the reasons, but the side effect is nobody has good data on children. Thus children are forever doomed to a bad experience, like the time my son asked "Hey Mycroft, how do you spell Kansas" "C-A-N-V-A-S" which he knew was wrong.

lunixbochs · on Oct 17, 2019

Maybe we could specifically target voice actors who convincingly voice children in animated stuff

Edit:

- There are already paid child actors, you could somewhat-manually collect their speech from e.g. movies and TV shows to have _something_

- Even if there's some copyright issue with distributing their audio directly, it's not clear (uncertain but dubious?) that a model trained on their audio would have any copyright concerns as long as it can't be used to reproduce the original audio

- What is Mozilla going to do if <1% of their dataset is already children who didn't put an age in? Is that a COPPA violation? There's even the defense of "an adult can sound like a child and we also don't know who this child is so how is it personal information" (I have no idea the usefulness of any of that)

akie · on Oct 17, 2019

It's going to be tricky.

I did some of the "listen" exercises to validate how people pronounced some sentences, and I got a few people who spoke with very strong (Indian, Nigerian, UK, ...) accents. How would you take these things into account? Just average all of them and hope for the best? Not sure how to approach that. I don't think it's very straight-forward. Interesting problem though.

However, you can't do anything if you don't even have the data, so props to Mozilla for starting this.

dijksterhuis · on Oct 17, 2019

The accents are sometimes tagged in the metadata csv files. So it’s possible to filter some of them out.

Mozilla DeepSpeech used to release checkpoints trained on Common Voice along with a few others (LibriSpeech etc). But they’ve dropped CV in the latest releases and just rely on the others.

I think the others are more standardised in terms of accents.

It’s likely possible to fine tune a model with different accents — so long as the language is the same then the model can up date the phonemes it recognises.

But accents are definitely a live issue.

yorwba · on Oct 17, 2019

In the worst case, you might have to treat different accents as different, but related languages. The current trend for low-resource languages seems to be about using one giant model for all languages to make use of shared features (e.g. for translation [1]), so adding even more languages might not be that expensive in terms of training data required.

[1] https://ai.googleblog.com/2019/10/exploring-massively-multil...

taneq · on Oct 17, 2019

Isn't this literally what locales are for? Instead of "English" you have "En-US", "En-UK", "En-Ind" etc.

yorwba · on Oct 17, 2019

Locales can help distinguish country-specific variants (e.g. US vs. UK spelling), but pronunciation varies a lot at the sub-national level. If you want to support all possible accents, you'll need a more fine-grained encoding.

jobigoud · on Oct 17, 2019

I think locales are for dialects, where you can have different terms used for the same concept. Here you can have someone speak the en-NZ dialect, but with a French accent.

Also, we would need en-FR, en-ES, en-IT, etc. All languages as spoken from all other native languages. And obviously the strength of the accent varies.

emteycz · on Oct 17, 2019

Yeah, also a Moravian (ancient nation united with the Czechs more than 1000 years ago) person will speak English differently than a Czech (as in nation, not state) person even though we all speak the Czech language.

joshuaissac · on Oct 17, 2019

A solution could be to tag both the dialect and the accent with language codes. Native speakers of Moravian Czech will probably have similar accents when they speak New Zealand English. Using Glottolog IDs as tags for example, this might be represented as { dialect:"newz1240", accent:"czec1259" }. If the program can already recognise the New Zealand English dialect and the Moravian Czech accent, it might then leverage both of those to recognise the speech of a Moravian person speaking New Zealand English.

m-p-3 · on Oct 17, 2019

I don't know if it helps, but I've recorded myself quite a bit. In my profile I could set a native language and accent, as well as additional language and accent.

There has to be that data somewhere, right?

https://i.imgur.com/Zfj9F5V.png

microcolonel · on Oct 17, 2019

They have fields associated with the user for accent, but those data are not shown to the listener last I checked.

bluGill · on Oct 17, 2019

It is a real problem. I sometimes invite a coworker traveling from a different country home for dinner.

squarefoot · on Oct 17, 2019

I've validated some speech and like others I sometimes found strong foreign accents from non native English speakers (just like I am) so I tried to be neutral at least with the sentences I could understand, which luckily were the vast majority. If however I may offer a suggestion to improve the service, some information should be given about how to produce good quality recordings before the user starts contributing, what mic to use and how, sound levels, equalizing, background noise etc. Some of the recordings were truly awful quality wise and probably would generate false negatives (ok, an AI should learn to sort those out, but maybe later). Also, some recordings although correct were stuttering badly probably due to network congestion on the contributors side; it could come handy a way to also tag those sentences as "correct but stuttering" so that in the future the AI could also learn how a formally well recited text would sound if coming from a problematic connection. Tagging (scoring maybe) could also be useful for sentences where just about everything is correct save for a single word or part of it. For example, one sentence was "The party was a Sikh-centered political party in the Indian state of Punjab." but the woman said "in this Indian state". I didn't mark it because either way would have been not entirely accurate.

Nice initiative though.

est31 · on Oct 17, 2019

If you are listening to the recordings and it's annoying to you that they have different loudness levels, you can try out my add-on that normalizes the levels: https://addons.mozilla.org/de/firefox/addon/vmo-audio-normal...

You can also see a demo here (requires git clone): https://github.com/est31/js-audio-normalizer

SamBam · on Oct 17, 2019

It is unclear to me what we're supposed to be validating the the "Listen" section.

1. That the words say what is written?

2. That the words are clear and easy to understand?

3. That the speech is fluent and natural and easy to listen to?

I would guess 1 and maybe 2, and not 3, but it's only a guess because I don't see it written down anywhere.

Many of the clips are very stilted, spoken slowly and unnaturally.(Also, sometimes the text doesn't make sense. "On a normal Hajj, it would be around to walk." But I'm guessing this doesn't matter.)

dabinat · on Oct 17, 2019

This may help: https://discourse.mozilla.org/t/discussion-of-new-guidelines...

SamBam · on Oct 17, 2019

That does help. And it should be linked clearly.

For those that don't want to read, the only real rule is: All the words are read exactly as written, and no other words are heard.

Clarity, stiltedness, background noise, pronunciation (so long as it's acceptable English), accent, etc don't matter.

nsomaru · on Oct 17, 2019

Given that this will be fed into an ML system, noise is good (?) so probably (1).

dang · on Oct 17, 2019

Related from 2018: https://news.ycombinator.com/item?id=17436958

plibither8 · on Oct 17, 2019

And 2017: https://news.ycombinator.com/item?id=14794654

intopieces · on Oct 17, 2019

I work in data collection for speech recognition systems and would love to work on this project full time. I wish they had openings.

terrycody · on Oct 18, 2019

But how this project can benefit us all?