- First, speech recognition that allows the machine to catch the words, phrases and sentences we speak
- Second, natural language processing to allow the machine to understand what we speak, and
- Third, speech synthesis to allow the machine to speak.
Speech Recognition or Automatic Speech Recognition (ASR) is the center of attention for AI projects like robotics. Without ASR, it is not possible to imagine a cognitive robot interacting with a human. However, it is not quite easy to build a speech recognizer.
The difficulty of speech recognition technology can be broadly characterized along a number of dimensions as discussed below:
1. Size of the vocabulary: Size of the vocabulary impacts the ease of developing an ASR. Consider the following sizes of vocabulary for a better understanding.
o A small size vocabulary consists of 2-100 words, for example, as in a voice-menu system
o A medium size vocabulary consists of several 100s to 1,000s of words, for example, as in a database-retrieval task
o A large size vocabulary consists of several 10,000s of words, as in a general dictation task.
o A small size vocabulary consists of 2-100 words, for example, as in a voice-menu system
o A medium size vocabulary consists of several 100s to 1,000s of words, for example, as in a database-retrieval task
o A large size vocabulary consists of several 10,000s of words, as in a general dictation task.
Note that, the larger the size of vocabulary, the harder it is to perform recognition.
2. Channel characteristics: Channel quality is also an important dimension. For example, human speech contains high bandwidth with full frequency range, while a telephone speech consists of low bandwidth with limited frequency range. Note that it is harder in the latter.
3. Speaking mode: Ease of developing an ASR also depends on the speaking mode, that is whether the speech is in isolated word mode, or connected word mode, or in a continuous speech mode. Note that a continuous speech is harder to recognize.
4. Speaking style: A read speech may be in a formal style, or spontaneous and conversational with casual style. The latter is harder to recognize.
5. Speaker dependency: Speech can be speaker dependent, speaker adaptive, or speaker independent. A speaker independent is the hardest to build.
6. Type of noise: Noise is another factor to consider while developing an ASR. Signal to noise ratio may be in various ranges, depending on the acoustic environment that observes less versus more background noise:
o If the signal to noise ratio is greater than 30dB, it is considered as high range
o If the signal to noise ratio lies between 30dB to 10db, it is considered as medium SNR
o If the signal to noise ratio is lesser than 10dB, it is considered as low range
For example, the type of background noise such as stationary, non-human noise, background speech and crosstalk by other speakers also contributes to the difficulty of the problem.
o If the signal to noise ratio is greater than 30dB, it is considered as high range
o If the signal to noise ratio lies between 30dB to 10db, it is considered as medium SNR
o If the signal to noise ratio is lesser than 10dB, it is considered as low range
For example, the type of background noise such as stationary, non-human noise, background speech and crosstalk by other speakers also contributes to the difficulty of the problem.
7. Microphone characteristics: The quality of microphone may be good, average, or below average. Also, the distance between mouth and micro-phone can vary. These factors also should be considered for recognition systems.
Despite these difficulties, researchers worked a lot on various aspects of speech such as understanding the speech signal, the speaker, and identifying the accents.
0 comments:
Post a Comment