Monday, June 8, 2009

Basics of speech recognition

I worked with an early speech-recognition start-up in the first few years of this decade.

(I say "early" because this really was the current system of speech recognition, but we were ironing out the kinks, including working closely with the actual speech-engine vendors to improve their products.)

A primer on speech recognition:
  • There are two main types, speaker-dependent and speaker-independent. Speaker-dependent speech recognition has been optimized for a given user. In other words, one person uses it, mainly, and the system should be very, very good at understanding what that one person is saying. Many people have this on their computers and use it for dictation (e.g. Dragon NaturallySpeaking). David Pogue of the New York Times has been pretty happy with it.
  • Speaker-independent speech recognition is set up to handle anyone talking to it, whatever accent or speed or tone of voice. This is what you'll encounter on the telephone, e.g. at most airlines. It's generally very good, but understandably not quite as good as speaker-dependent speech recognition -- say, 95% instead of 98% (note these percentages are illustrative, they're not supposed to be exact, I haven't seen recent good data on this subject).
  • ASR -- acronym for automated speech recognition, the system that compares the speech to its vocabulary and decides what was said.
  • DTMF -- acronym for dual-tone multi-frequency. Basically refers to touchtone keypad-punching.
  • Voice recognition -- recognizing the identity of the person who is speaking. This is different from speech recognition, but the term is often used incorrectly as a synonym.
  • Voice authentication -- verifying the claimed identity of a person using their speech. Again, this is different from speech recognition.
- Bruce

No comments:

Post a Comment