SentenceMixer is a simple TTS engine which was made for easy voice creation. It uses the same technique as sentence-mixing in Youtube Poops, that is to say cutting and pasting small bits of words. A sample french voice (la boule magique !) is provided.
Please keep in mind that this is very hacky!
How it works
A sound file is annotated with the start and end markers of words or bits of sentences. SentencMixer converts that to phonemes, interpolates their start and end times and builds a corpus.
To synthesize a sentence, SentenceMixer will first convert it into phonemes, then try to find the largest fragments in the corpus. He will then put the pieces back together.
java -jar target/sentencemixer-*.jar <voice name> with voice name being e.g. boule
The program reads sentences to be spoken from its standard input. It then speaks them, and prints the number of chunks and phonemes spoken as it goes on its standard output. For example, using the boule voice and the phrase "philippe salaud je sais où tu t'caches", you can see that it found at most 2 contigous phonemes. The larger and more diverse the corpus is, the more likely it is to find big chunks of phonemes, and the better the output will be.
filipsaloZ@sEz2utytkaS 13 fi li p s al oZ @s E z2 ut yt ka S
This tool is written in java, so make sure you have a JDK and maven installed. You'll also need espeak for phoneme generation and sox for audio processing.
Clone the repo, then build with
Creating your own voices
A voice consists a directory containing two files: an audio file (
audio.wav), and a markers file (
wordMarkers.txt). The markers file is essentially a list of start and end timestamps, and of words or phrases.
Creating a voice is pretty straightforward. It consists of:
- Importing your audio into Audacity;
- Creating a Label Track (Tracks -> Add New -> Label Track) or importing an existing one (File -> Import -> Labels);
- Annotating the maximum number of words (Ctrl+B creates a marker around the selection);
- Saving the audio and the markers in a new folder with your name of choice in the
You can use the boule voice as an example. Keep in mind that the more audio is annotated, the better the output will be.
- The search algorithm (
Voice::findBestCandidates) is not quite correct, because the correct one is supposedly too slow. The one provided is good enough®.
- The phonemes are interpolated linearly within words, so they may end up badly aligned.