Source: MIT Technology Review,
“I see speech approaching a point where it could become so reliable that you can just use it and not even think about it,” says Andrew Ng, Baidu’s chief scientist and an associate professor at Stanford University. “The best technology is often invisible, and as speech recognition becomes more reliable, I hope it will disappear into the background.”
in recent years, thanks to some impressive advances in machine learning, voice control has become a lot more practical.
No longer limited to just a small set of predetermined commands, it now works even in a noisy environment like the streets of Beijing or when you’re speaking across a room.
Jim Glass, a senior research scientist at MIT who has been working on voice technology for the past few decades, agrees that the timing may finally be right for voice control. “Speech has reached a tipping point in our society,” he says. “In my experience, when people can talk to a device rather than via a remote control, they want to do that.”
Last November, Baidu reached an important landmark with its voice technology, announcing that its Silicon Valley lab had developed a powerful new speech recognition engine called Deep Speech 2. It consists of a very large, or “deep,” neural network that learns to associate sounds with words and phrases as it is fed millions of examples of transcribed speech.
Deep Speech 2 can recognize spoken words with stunning accuracy. In fact, the researchers found that it can sometimes transcribe snippets of Mandarin speech more accurately than a person.
Baidu’s progress is all the more impressive because Mandarin is phonetically complex and uses tones that transform the meaning of a word. Deep Speech 2 is also striking because few of the researchers in the California lab where the technology was developed speak Mandarin, Cantonese, or any other variant of Chinese. The engine essentially works as a universal speech system, learning English just as well when fed enough examples.