The Star Trek dream of unlimited speech recognition has come and gone. We were promised a world of unprecedented man-machine interface compatibility – seamless integration of human communication with that of the computational world. We were promised artificial intelligence that could understand our speech, our desires, our needs. We were promised a new world where technology would be the answer to our own limitations, our own inability to communicate with one another. But the promise has not been kept.
Automatic dictation systems are flawed with a large number of incorrect base recognitions, rely on massive amounts of data collection, and yet are hardly ever perfect. Automatic voice activation, navigation, and Interactive Voice Response (IVR) systems have all used automatic speech recognition, with varying levels of customer acceptance, and more importantly, with often poor levels of customer satisfaction. Automatic speech recognition for interfacing with household appliances, toys, and other day-to-day objects has been tried, sometimes successful, sometimes not, but always with heavy trepidation to the uncompromising hand of product acceptance. Will people actually use it? Will consumers actually buy it? Do people actually like it?
Millions upon millions of dollars in research and development has been poured into speech recognition. The return has been less than spectacular. So, what is the basic problem here? Is it that so much was promised that it was an impossible task from the beginning? Or perhaps, the problem lies in our inability to use speech recognition in ways that are best suited to the technology along with its basic limitations.
So, what is automatic speech recognition really good at? It is really good at recognizing closed sets of possible utterances. Where can it be used where people actually like to use it? This is a very good question – some would say the ‘million dollar question’. Perhaps it would be better to phrase it like this: Where can it be used where people would actually prefer speaking with closed sets of utterances, as opposed to the more free open ended types of responses that one is normally accustomed to using when speaking? Let’s face it. If I want to reserve a ticket, I would really rather talk to a human person who can understand my natural and freely produced speech, not to a machine that only allows me to answer in a highly constrained way, or even worse, does not recognize what I’ve said if I’ve strayed from this tightly controlled ‘speaking path’. So, who would rather speak in a constrained manner?
The answer is actually quite simple: Those who are unsure of how to answer. A good example of this is second language learners, who would much prefer to answer with utterances that are provided to him/her, as opposed to ones that must be thought up from scratch. If you really think about it, this is a particularly good case where constraining language usage actually serves a purpose for the human, and not the machine. The language student is really not sure of the proper words to use, the exact grammatical construction, cultural constraints, and the like. Language learning involves providing such a student with sample examples of usage in ordinary conversations, highly constrained in terms of vocabulary, syntax and dialog structure. Rather than constrain the language so that the technology can handle speech, use of automatic speech recognition for language learning uses these constraints in a pedagogically motivated way. So, a good application of speech recognition is in the realm of language learning – one that uses the technology for what it does best, but also has the best interests of humans at heart.