Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. In this study, a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets was developed and validated. This approach adopts a multispeaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.92 for speech intelligibility, 3.85 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. The final trained model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.
Speaker ID | Source Filename | Transcript | Source audio | Synthetic audio |
Speaker '002109' | myst_002109_2014-03-12_12-06-01_LS_2.1_004 | the cookie is becoming yiest. | ||
Speaker '004031' | myst_004031_2013-12-03_10-12-27_EE_1.4_004 | a candle burning heat. | ||
Speaker '004031' | myst_004031_2013-12-03_10-12-27_EE_1.4_016 | wood gas food and umm a candle | ||
Speaker '007233' | myst_007233_2015-02-17_11-15-06_LS_2.1_011.wav | the plants basic needs are dirt water and sunlight | ||
Speaker '008061' | myst_008061_2014-05-12_13-48-04_EE_5.2_006 | this is a prism splitting all the different colors of light | ||
Speaker '013064' | myst_013064_2013-11-15_13-51-14_EE_1.1_007 | well they will transfer energy or electricity to something to like a light bulb so it can light up | ||
Speaker '022201' | myst_022201_2014-12-10_10-43-55_LS_1.3_020 | being an herbivore means that you only eat plants | ||
Speaker '993755' | myst_993755_2011-02-03_00-00-00_VB_3.3_002 | we have been testing variables on the airplanes | ||
Speaker '995520' | myst_995520_2010-12-03_00-00-00_ME_4.3_002.wav | um we've been doing electricity and magnetism | ||
Speaker '996228' | myst_996228_2008-04-12_00-00-00_MS_2.1_030 | well it was a long time ago but i think i can remember how many big steps it took |
Note: Speaker '002113' represent the child speaker ID. We present results from 4 speaker ID used in our paper.
Speaker '002113' Transcript | Speaker '002113' Audio | Speaker '008045' Transcript | Speaker '008045' Audio | Speaker '013020' Transcript | Speaker '013020' Audio | Speaker '995737' Transcript | Speaker '995737' Audio |
i think it's a migration behavior because the black thing is stuck on the human's knee | you can see that the white light makes the candle green because without the light it would not be any color | well the dog is just waging it's tail the bee is just like pollinating and the geese are flying |
Speaker '002113' Transcript | Speaker '002113' Audio | Speaker '008045' Transcript | Speaker '008045' Audio | Speaker '013020' Transcript | Speaker '013020' Audio | Speaker '995737' Transcript | Speaker '995737' Audio |
Adding fast leads to wrong sums | The boy was there when the sun rose | The boy was there when the sun rose | Rice is often served in round bowls | ||||
We talked of the side show in the circus | The box was thrown beside the parked truck | The beauty of the view stunned the young boy | The box was thrown beside the parked truck | ||||
The salt breeze came across from the sea | The salt breeze came across from the sea | The salt breeze came across from the sea | The salt breeze came across from the sea | ||||
He ran half way to the hardware store | The friendly gang left the drug store | Take the winding path to reach the lake | Read verse out loud for pleasure | ||||
A pound of sugar costs more than eggs | Adding fast leads to wrong sums | Adding fast leads to wrong sums | The ship was torn apart on the sharp reef | ||||
The wharf could be seen at the farther shore | Move the vat over the hot fire | The show was a flop from the very start | The wide road shimmered in the hot sun | ||||
He said the same phrase thirty times | Leaves turn brown and yellow in the fall | He ran half way to the hardware store | The friendly gang left the drug store | ||||
Bring your problems to the wise chief | The pennant waved when the wind blew | Add the sum to the product of these three | Adding fast leads to wrong sums | ||||
The just claim got the right verdict | Hemp is a weed found in parts of the tropics | Hemp is a weed found in parts of the tropics | We talked of the side show in the circus | ||||
The cigar burned a hole in the desk top | The cigar burned a hole in the desk top | It caught its hind paw in a rusty trap | The walled town was seized without a fight |
Transcript: Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers. Where’s the peck of pickled peppers Peter Piper picked?
Speaker ID | Source Filename | Source audio | Synthetic audio |
Speaker '004034' | myst_004034_2013-11-18_09-01-54_EE_1.1_010.wav | ||
Speaker '993755' | myst_993755_2011-02-03_00-00-00_VB_3.3_002.wav | ||
Speaker '995738' | myst_995738_2011-16-02_00-00-00_ME_2.2_016.wav | ||
Speaker '995739' | myst_995739_2011-10-02_00-00-00_ME_1.4_005.wav | ||
Speaker '997230' | myst_997230_2009-08-01_00-00-00_ME_3.2_003.wav |