Amazon Alexa scientists reduce speech recognition errors by 20% with semi-supervised learning

Deep neural networks take time to train — and lots of data — and that’s particularly true of speech recognition systems. Conventional models tap corpora comprising thousands of hours of transcribed voice snippets. It’s not exactly surprising, then, that scientists at Amazon’s Alexa division are investigating ways to expedite the process. Today they reported that they’ve made substantial headway.

In a blog post and accompanying paper (“Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-Student Learning”), Minhua Wu, an applied scientist in the Alexa Speech group, and colleagues describe a speech recognizer that identifies data patterns in a semi-supervised fashion. By learning to make use of a few unlabeled samples, they claim that an experimental model trained on 800 hours of annotated data and 7,200 hours of “softly” unannotated data — with a second speech system fed the same data samples (but with artificially generated noise) — the design achieves a 20 percent reduction in word error rate compared with the baseline.

“We hope to … improve the noise robustness of the speech recognition system,” Wu wrote.

As she and colleagues explain, automatic speech recognition systems consist of three core components: an acoustic model, a pronunciation model, and a language model. The acoustic piece takes as input short audio samples, or frames, and for every frame outputs “thousands” of probabilities. (Each probability indicates the likelihood that any given frame belongs to a low-level phonetic representation called a senone.) In the proposed approach, the acoustic model’s outputs are fed into the pronunciation model, which converts the senone sequences into possible words and passes those to the language model, which encodes the probabilities of word sequences. Lastly, all three AI systems together work to find the most likely word sequence given the audio input.

The paper’s authors first sought to optimize the acoustic model’s maximum accuracy and minimize errors across sequences of outputs, mostly through sequence training. They report that this made the student model’s counterpart — the “teacher” model — more accurate and increased the student model’s relative improvement. Next, they added noise to the training data by collecting audio samples from music, television, and other media and processing them to simulate closed-room acoustics. For each speech example in the training set, they randomly selected one to three noise samples to add to it.

According to Wu and colleagues, forcing the student model to train strictly on audio data with the highest probabilities — from five to 40 — enabled it to devote more resources to distinguishing among probable ones, and subsequently minimized errors even on a noise-free test data set.

In tests, the team employed two additional corpora: a set of clean audio samples and a set of samples to which they added noise. The best-performing student model, they say, was first optimized according to the per-frame output from the teacher model, using the entire 8,000 hours of data with noise added and then trained on the 800 hours of annotated data. Relative to a teacher model trained on 800 hours of hand-labeled clean data, it showed a 10 percent decrease in error rate on the clean test data, a 29 percent decrease on the noisy test data, and a 20 percent decrease on re-recorded noisy data.

The research is scheduled to be presented at the International Conference on Acoustics, Speech, and Signal Processing in Brighton this spring.

Leave a Reply