Learning English with Alexa
A new learning experience that will favor the Spanish-speaking population in the United States.
MORE IN THIS SECTION
Earlier this year, Alexa launched a new experience in Spain to help Spanish speakers learn English from scratch.
Developed in collaboration with Vaughn, the leading English learning provider in Spain, the immersive English program focuses on optimizing users' pronunciation.
Daniel Zhang, an applied scientist with Alexa AI, announced via his Amazon Science blog:
We are now expanding this offering to Mexico and the Spanish-speaking population of the United States, and we plan to add more languages in the future. This learning experience includes structured lessons in vocabulary, grammar, expression and pronunciation, with practical exercises and tests.
Zhang points out that the most important thing about this new experience is its pronunciation function, which provides accurate information whenever a customer mispronounces a word or phrase.
“At this year's International Conference on Acoustics, Speech, and Signal Processing (ICASSP), we presented a paper describing our innovative method of mispronunciation detection,” said Zhang.
The method implemented by Alexa uses a novel recurrent neural network (RNN-T) phonetic model that predicts phonemes, the smallest units of speech, from the learner's pronunciation.
In this way, the model can provide a detailed assessment of pronunciation at the word, syllable or phoneme level, meaning that if a learner mispronounces, for example, the word "rabbit" as “rabid,” the model will detect the part that is mispronounced and will compare it with the correct sequence.
To try it out, set your device's language to Spanish and tell Alexa “I want to learn English.”
Zhang highlights what he calls two knowledge gaps that have not been addressed in previous pronunciation models.
By designing a multilingual pronunciation lexicon and creating a vast mixed phonetic dataset for the learning program, the program has the ability to distinguish similar phonemes in different languages, for example, the "r" rolled in Spanish vs. the "r" in English.
“The other knowledge gap is the ability to learn unique mispronunciation patterns from language learners, taking advantage of the autoregressiveness of the RNN-T model, that is, the dependence of its results on previous inputs and outputs. This context awareness means that the model can capture frequent patterns of mispronunciation from the training data,” explains Zhang.
The Science Behind the Experience
Noting that it is usual to use a sequence-to-sequence model in grammatical error correction tasks, in this case the direction of the task was reversed by training the model to mispronounce words instead of correcting pronunciation errors.
In addition, to further enrich and diversify the generated L2 phoneme sequences, a preference-aware diversified decoding component was proposed that combines a diversified bundle search with preference loss that leans towards mispronunciations similar to those humans.
In short, the phoneme paraphraser can generate realistic L2 phonemes for speakers of a specific location, for example, phonemes that represent a native Spanish speaker speaking English.
“We are currently studying various methods to improve our pronunciation evaluation feature. One of them is the creation of a multilingual model that can be used to assess pronunciation in many languages. We are also expanding the model to diagnose more mispronunciation features, such as tone and lexical stress,” added Zhang.