Recent newscasts, scientific reports and blogs have been filled with articles on how deepfakes are eroding our trust in truth as they rapidly advance into the mainstream. By 2030, speech synthesis will be a reality everywhere, coupled with deepfake videos.
The Google of China, Baidu, has just released a white paper showing its latest development in artificial intelligence (AI): a program that can clone voices after analyzing even a seconds-long clip, using a neural network. Not only can the software mimic an input voice, but it can also change it to reflect another gender or even a different accent.
Previous iterations of this technology have allowed voice cloning after systems analyzed longer voice samples. In 2017, the Baidu Deep Voice research team introduced technology that could clone voices with 30 minutes of training material. Adobe has a program called VoCo which could mimic a voice with only 20 minutes of audio. One Canadian startup, called Lyrebird, can clone a voice with only one minute of audio. Baidu’s innovation has further cut that time into mere seconds.
The Facebook team used audio from TED Talks to train its system, and they share clips of it mimicking eight speakers, including Gates, on a GitHub website.
You can listen to some of the Baidu generated examples here. One cloned voice fooled voice recognition tech with 95 percent accuracy.
One the positive side, imagine your child being read to in your voice when you’re far away, or having a duplicate voice created for a person who has lost the ability to talk. This tech could also be used to create personalized digital assistants and more natural-sounding speech translation services. At FutureWork, we are working with one of our partners showing clients the technology to allow professional actors using different voices and avatars to coach and give feedback to employees practicing a method of interrupting unconscious bias.
However, as with many technologies, voice cloning also comes with the risk of being abused. New Scientist reports that the program was able to produce one voice that fooled voice recognition software with greater than 95 percent accuracy in tests. Humans even rated the cloned voice a score of 3.16 out of 4. This could open up the possibility of AI-assisted fraud. One company, Dessa, won’t publicly release its research, model, or datasets fearing that someone could impersonate a government official to enter a high-security facility or a politician to manipulate an election.
Programs exist that can use AI to replace or alter — and even generate from scratch — the faces of individuals in videos. Right now, this is mostly being used on the internet to bring laughs by inserting Nicolas Cage into the “Lord of the Rings” series. But coupled with tech that can clone voices, we soon could be bombarded with more “fake news” of politicians doing uncharacteristic actions or saying things they wouldn’t. For example, click here to see a deepfake video of Barack Obama, a synthetic Obama created by the University of Washington researchers using a neural network AI to model the shape of Obama’s mouth, mapping their model to 14 hours of footage and audio of the President. They now can match any audio to a synthetic Obama. Listen for yourself to see what the future holds as deepfakes proliferate. Can you tell the difference from the real Obama?