The Mozilla Common Voice dataset is awesome - however it's useful the opposite purpose - speech-to-text. This is because it is a lot of different people using a range of hardware, speaking similar phrases.
For good text-to-speech you need 1 person speaking different phrases but very consistently. Here's an example dataset from Thorsten a German open voice enthusiast: https://openslr.org/95/
For good text-to-speech you need 1 person speaking different phrases but very consistently. Here's an example dataset from Thorsten a German open voice enthusiast: https://openslr.org/95/