Ever wish you could automatically dub foreign film dialogue into another tongue? Amazon’s on the case. In a paper published this week on the preprint server Arxiv.org, researchers from the tech giant detailed a novel “speech-to-speech” pipeline that taps AI to align translated speech with original speech and fine-tune speech duration before adding background noise and reverberation. They say that it improves the perceived naturalness of dubbing and highlights the relative importance of each proposed step.
As the paper’s coauthors note, automatic dubbing involves transcribing speech to text and translating that text into another language before generating speech from the translated text. The challenge isn’t simply conveying the same content of the source audio, but matching the original timbre, emotion, duration, prosody (i.e., patterns of rhythm and sound), background noise, and reverberation.
Amazon’s approach synchronizes phrases across languages and follows a “fluency-based” rather than a content-based criterion. It comprises several parts, including a Transformer-based machine translation bit trained on over 150 million English-Italian pairs and a prosodic alignment module that computes the relative match in duration between speech segments while measuring the linguistic plausibility of pauses and breaks. A model in the text-to-speech phase trained on 47 hours of speech recordings generates a context sequence from text that’s fed into a pretrained vocoder, which converts the sequence into a speech waveform.
To make the dubbed speech sound more “real” and similar to the original, the team incorporated a foreground-background separation step that extracted background noise and added it to the speech. A separate step — a re-reverberation step — estimates the environment reverberation from the original audio and applies it to the dubbed audio.
In order to evaluate their system, researchers had volunteers — 14 total, 5 Italian and 9 non-Italian — grade the naturalness of 24 excerpts of TED Talks with Italian dubbing in three different ways: With a speech-to-speech translation baseline, the baseline with enhanced machine translation and prosodic alignment, and the former system enhanced with audio rendering.
The researchers report that they succeeded at achieving phrase-level synchronization, but that the prosodic alignment step negatively impacted the fluency and prosody of the generated dubbing. “The impact of these disfluencies on native listeners seems to partially mask the effect of the audio rendering with background noise and reverberation, which instead results in a major increase of naturalness for non-Italian listeners,” wrote the paper’s coauthors. “Future work will definitely devoted to improving the prosodic alignment component, by computing better segmentation and introducing more flexible lip synchronization.”