Just tried rating some of the English voices and I am conflicted.
Most of them were definitely speaking English, but in an Indian intonation that I was barely able to understand coming from an English as a First Language country.
Some of them were reading words syllable by syllable, which is definitely English, but I would hate to have to listen to an ebook or webpage read aloud to me in that manner.
By clicking yes am I training the system to speak English with an Indian intonation?
Should I click no, not English?
Should/does english even have a "proper" intonation?
> Be cautious before rejecting a clip on the ground that the reader has mispronounced a word, has put the stress in the wrong place, or has apparently ignored a question mark. There are a wide variety of pronunciations in use around the world, some of which you may not have heard in your local community. Please provide a margin of appreciation for those who may speak differently from you.
> On the other hand, if you think that the reader has probably never come across the word before, and is simply making an incorrect guess at the pronunciation, please reject. If you are unsure, use the skip button.
I think this dataset is mainly for speech recognition and not text to speech. Speech recognition should be able to recognize as many different accents as possible.
I think the reality is that there are more speakers of bad English than native speakers. I speak 2 foreign languages (including English) daily and 2 others occasionally. I know I make mistakes in all of them. In English I don't think I make a lot of pronunciation mistakes (there are some mistakes in grammar for sure). In Finnish I make a lot of pronunciation mistakes, although I speak better than many other non-native speakers. How much that really hurts understanding I have no idea. The amount of misunderstandings between humans does not seem to vary greatly between those languages or even my mother tongue.
Text to speech should work correctly. But speech recognition should tolerate even clear mistakes. Of course not for the price of misunderstanding correct pronunciation.
Wow you're right. This is conflicting as many of the words are not pronounced properly at all. Maybe it doesn't matter to the accuracy of the speech-to-text system, but it feels like training it with bad data.
That's the point! When the postal service has to OCR mailing addresses, they need to do the messy scribbles more than the professionally printed labels.
Different accents isn't bad data. Your vision of the world of "english is only spoken with an american accent" is what leads to horrendous speech recognition APIs, like Google's.
If your ML model can't handle multiple accents, it is worthless.
There's a difference between an accent and pronouncing words wrong. I would expect an English speech recognition system to handle the various accents there are in the world (the US has several accents of course), but it shouldn't handle incorrect pronunciation of syllables if it comes at the expense of recognizing clean data. If it doesn't come at its expense then I guess it's fine.
Unfortunately, there's always a trade-off. You want both quality data for your use case, but you also want lots of data so it generalizes well. Those are conflicting goals.
Fortunately, splitting models into separate accent-specialized variants and helping them out with language model training will often help in case the model doesn't cope well enough with the cognitive dissonance.
Most of them were definitely speaking English, but in an Indian intonation that I was barely able to understand coming from an English as a First Language country.
Some of them were reading words syllable by syllable, which is definitely English, but I would hate to have to listen to an ebook or webpage read aloud to me in that manner.
By clicking yes am I training the system to speak English with an Indian intonation?
Should I click no, not English?
Should/does english even have a "proper" intonation?