Alexa speech normalization AI reduces errors by up to 81%

Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon’s Alexa, “book me a table at 5:00 p.m.” might be transcribed by the assistant’s automatic speech recognizer as “five p m” and further reformatted to “5:00PM.” Inversely, Alexa might convert “5:00PM” to “five thirty p m” for Alexa’s text-to-speech synthesizer.

So how’s this work? Currently, Amazon’s voice assistant relies on “thousands” of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu. That’s all well and fine for English, but because the approach isn’t particularly adaptable to other languages (without lots of manual labor), Amazon scientists are investigating a more scalable technique driven by machine learning.

In a preprint paper (“Neural Text Normalization with Subword Units”) scheduled to presented at the North American Chapter of the Association for Computational Linguistics (NAACL), Sun , Liu, and colleagues describe an AI text normalization system that breaks words in input and output streams into smaller strings of characters called subword units. These subword units, Sun and Liu explain in a blog post, reduce the number of inputs that the machine learning model must learn, and help to clear up ambiguity in snippets like “Dr.” (which could mean “doctor” or “Drive”) and “2/3” (which could mean “two-thirds” or “February third”).

Furthermore, subword units help the AI model decide how to treat input words it hasn’t seen before. Unfamiliar words might contain familiar subword components, and those might be enough to help the model decide on a course of action.

The researchers’ system created subword units by reducing words in a training data set to individual characters, which an algorithm ingested to identify the most commonly occurring two-character units, three-character units until it reached capacity (around 2,000 subwords). These components were used to train an AI system to output subword units, which a separate algorithm stitched together into complete words.

Trained on 500,000 examples from a public data set, the researchers say that their system achieved a 75% reduction in error rate compared with to the best-performing machine learning system previously reported, and a 63% reduction in latency, or the time it takes to receive a response to a single request. By factoring in additional information such as words’ such as parts of speech, position within the sentence, and capitalization, it managed a further error rate reduction of 81% and a word error rate of just 0.2%.

Leave a Reply