You can easily create a simple speaker-dependent audio model for speech recognition by following the VoxForge tutorial, using tools like the Hidden Markov Model Toolkit (HTK), Julius, and CMU Sphinx (see Resources for more on these tools). In less than a few hours, you can have your own 2001: A Space Odyssey HAL 9000 computer responding to your commands and conversing with you with little possibility of finding the door of your room locked if you leave your workstation to go to the bathroom.
However, beginners are sometimes surprised that the recognition accuracy of their first model is unacceptable. It's easy to give up on further experimentation in voice control, thinking that the field has not advanced to a passable level of accuracy. This is far from the truth.
In addition—and perhaps more seriously—the more errors you find and the harder you focus on speaking carefully into the microphone to suit an apparently balky recognizer, the more you may verge on straining your voice as you try to control your vocal chords in unnatural ways to drive the expected results. There are good reasons to strive for high accuracy in voice recognition with relaxed input.
Simply adding more audio samples might be an effective remedy for this problem. Also, adapting a preconfigured speaker-independent model is another possible avenue (see the VoxForge and CMU Sphinx resources for guidance on how you can do this). A third method is to focus on the problem of poor recognition accuracy using programming tools and devise corrections targeted at the fundamental issue.
Programmers can play a role in alleviating poor recognition accuracy. Tools that describe the environment in which you are working can help identify the nature of recognition problems and encourage users to focus quickly on the basic issue and potential solutions.
The HTK provides many tools for the creation phase of the model, and Julius provides descriptive messages about the recognition phase. But there are limits to this approach: For example, the writers of HTK and Julius can make no assumptions about what grammar you want to use. You are responsible for handling errors arising out of your choice of grammar.
The sections that follow discuss some specific problems that can arise, with examples and commentary on whether there is potential for a special tool to help identify and resolve the issue.
Types of problems
Looking at the types of errors that can interfere with accuracy of recognition, you can discern a number of different reasons for errors:
- The familiar concept of "garbage in, garbage out" leads to the fuzzy model problem. Unless the sample prompts provided to the audio model creation process are clean and relevant, HTK cannot create the model properly. For example, you can have recorded .wav files matched with the wrong prompt; the .wav file is acceptable but only deciphered by HTK with difficulty (perhaps because of volume discrepancies); elided sounds resulting from short silences between words; or clicks, pops, and background sounds intruding. The solution is to investigate audio file problems and perhaps re-record.
- The narrow model pops up if you try to use a speaker-dependent model trained with a voice other than your own. In this case, your results are likely to be poor. In addition, if you train a model yourself with a wire-connected headset, and then try to use the generated model to issue commands with a Bluetooth headset, chances are that your recognition accuracy will fall below 50%, even though you are using the same voice. And yet further, with the same headset but connected to a different machine, your results will again be a disappointment. You might call this inadequate breadth of the model, and you can remedy it by adding audio files specific to the person, device, or platform where the problem occurs.
- Audio models need both breadth and depth. Individuals have different ways of saying the same thing on different days because of emotional state, atmospheric conditions, and so on. With no change in operating platform, speaker, or headset device, if the speaker has a cold or feels impatient or is slightly tipsy, the voice will change either dramatically or subtly—enough to confuse a model trained under different conditions, leading to a shallow model (1). You can increase the depth by examining the lexicon closely for accurate descriptions of how words sound and editing or adding more lexicon entries if necessary.
- The shallow model (2) is similar to the previous point but not quite the same. On the same day with the same voice, in the same conditions, there are words that we pronounce differently based perhaps on context or whim. For example, Tanzania: is that [Tanza-niya] or [Tan-Zania] or both? Schedule ([sked-ule], [shed-ule]), status ([stay-tus], [stattus])? You can force yourself to be consistent or add more entries in the lexicon. Another subtle issue is long and short silences between words in audio recordings.
- The model is based on phonemes, which are discrete sound representations of what the recognizer expects to hear. The phonemes are stored in the lexicon, and inaccuracies or inconsistencies in the phoneme and their associated sounds can lead to errors. You can smooth out the issue of incorrect phoneme breakdown of words used or missing alternate pronunciations by identifying and correcting the entry in the lexicon.
- Unfortunately, many languages offer a poor balance of phoneme usage in
normal speech, with the result being an incomplete
phonemic balance in the grammar. Where the
phonemes are poorly represented because of infrequent use in the
language, you can expect problems, because your opportunity to
exercise them will be limited. In my own English grammars, phonemes in
the set ([ax], [n], [s], [t]) tend to be heavily used, and at the
other end of the spectrum, those in the set ([oy], [ar], [el], [ur])
are rarely used. The solution is to identify the problem rules, and
then adjust grammar, use synonyms, view output from
HDMan, or even invent your own nonstandard terms that exercise the least-used phonemes.
- The phonemically closely related lexemes problem arises from everyday language that coerces us into using words that may be difficult to separate at the recognizer level, so the choice of commands in the grammar is a fine art. My models sometimes confuse pipe and nine: My solution was to change pipe to the compound words pipe_symbol or vertical_bar.
If your grammar has only a few rules and one of them consistently displays recognition problems, then you can focus quite quickly on the problem. However, with larger grammars you need a method of systematically testing the model's accuracy.
Tool 1: Identify problem words
Tool 1 is a testing routine to identify the word combinations that are a problem and store them for later review and revision. The benefit here is rather than adding more samples in general, you are just adding more examples of those grammar rules that the recognizer has difficulty with, which saves time.
Here, the machine reads the prompts file, stores the entries in an array and randomly shuffles it, then presents the prompts to you one by one on the screen or through audio, asking you to speak that prompt to the recognizer. The script then compares the output from Julius with what it was expecting to hear. If they are not the same, then it records the problem for later analysis and loops to the next prompt.
This script uses Python (see Resources) and stores the issues in a PostgreSQL (see Resources) back end. You can do the same thing with PHP, Perl, and MySQL or other tools. The choice makes a few assumptions about your working environment: First, you have a flat file of "prompts" that consists of a list of testable grammar rules and the associated audio file names. These are arranged as file name followed by the prompt, which can be either a single word or two words. Here is an example. In this tool, you aren't using the audio file name at all, so a separate file with just sample prompts would work as well:
*/mysample1 ONE */mysample2 COMPUTER QUIT
Also, you have a PostgreSQL table set up with four fields: an ID number to
make records unique and three character-varying fields—one to store
the device you are testing with, another to store what the computer asked
for, and the third to store what the recognizer thought it heard. In Listing 1,
the name of the script,
headset is an
identifier for the headset you are using, and
100 is the number of prompts you want to test
before the script stops. The output from Julius is piped into the script.
Listing 1. Test your prompts with Python and PostgreSQL
# # call with: # julius (your options) -quiet -C julian.jconf | python testprom.py headset 100 # import sys import string as str import os import random import psycopg2 # database setup conn = psycopg2.connect("host='xxx' user='yyy' password='zzz' dbname='qqq'") cur = conn.cursor() # get the command line arguments device = sys.argv limit = int(sys.argv) # convert CLI string to integer # get the prompts proms =  f = open('prompts', 'r') for prompt in f: words = str.split(prompt) if len(words) == 2: thisprom = words if len(words) == 3: thisprom = words+' '+words if thisprom not in proms: proms.append(thisprom) f.close() random.shuffle(proms) # run the tests i = 0 challengeprom = "please say something" print challengeprom while i < limit: line = sys.stdin.readline() if line[:4] == "sent": line = str.replace(line,"sentence1: <s> ","") line = str.replace(line," </s>","") heardprom = str.strip(line) if heardprom == challengeprom or i == 0: print "OK" else: sql = "insert into gramerrors (device, prompt, heard) values (%s,%s,%s)" data = (device,challengeprom,heardprom) cur.execute(sql, data) conn.commit() print "problem recorded: <"+challengeprom+'> <'+heardprom+'>' # challengeprom = proms[i] print challengeprom i += 1 # tidy up cur.close() conn.close()
In Listing 1 (it is fundamentally a highly specialized dialog manager), you first import the required libraries, then establish the connection to the PostgreSQL back end and read in the command-line parameters. The script then opens the prompts file for reading and reads line by line, storing the prompts in a list but only if they are not already in the list. Once it has finished reading, it randomly shuffles the list.
Next, the script loops through the shuffled prompts starting at the top of
the list. It presents each one on the screen (you could also have
Festival—see Resources—announce it)
and waits for output from the speech recognizer, which is triggered when
you speak into the microphone;
what it hears. The script then compares the expected prompt with the
prompt actually heard. Ideally, they will be the same thing, and the
script proceeds to the next prompt. If there is a problem, the script
records the device, expected prompt, and actual prompt in the back end
before proceeding to the next one. It stops when the limit is reached or
the file is exhausted.
The result is a table of prompts that have been problematic and the device that raised the problem. This table is informative if you regularly use more than one device. As you run more tests on different days with different devices, you build a picture of where problems occur. Is it always the same device? The same prompts? And does Julius consistently output the same incorrect grammar rule? This is your guide for further analysis.
Tool 2: Review existing audio samples
Perhaps your testing shows that one prompt is consistently a problem. Your first suspicion might be that one or more of the audio files for that prompt is bad and needs to be replaced. Maybe the audio file does not match the prompt, or the volume is too high and distorted or too low and HTK had a problem extracting meaningful data. The information you need is in the prompts file. The script in Listing 2 loops through selected prompts and plays the audio.
Listing 2. Review audio samples
# # called with "$ python audioreview.py word1 word2" # import sys import string as str import os f = open('prompts', 'r') path = './wavs/' for line in f: words = str.split(line) if words == sys.argv and words == sys.argv: mywav = words + '.wav' wavfil = path + mywav[2:] os.system("aplay " + wavfil)
The script imports a few modules and opens the prompts file for reading.
Then, reading one line at a time, it compares the words in the prompts
with the words specified on the command line. If they are the same, the
script plays the related audio file using a system call to an audio
player—in this case,
aplay. The script
pauses as it plays each file. In this case, you are relying on your ears
to tell you whether one of the audio files should be replaced. The script
is a handy tool to quickly review audio. If you find a bad one, just
replace the audio file using your normal recording process.
Tool 3: Review the choice of phonemes in the lexicon
The lexicon is a list of words and their phoneme representation. Listing 3 provides an example of what could appear in the lexicon.
Listing 3. Sample lexicon entries
BLACK [BLACK] b l ae k BLUE [BLUE] b l uw BRAVO [BRAVO] b r aa v ow BROWN [BROWN] b r aw n
The lexicon contains three fields on each line: the word (lexeme), its
representation when printed (grapheme), and finally the ordered phoneme
set that represents the sounded word. In this case, the three fields are
separated by five spaces: It is important to note this separator to allow
split method in Listing
4 to retain the phonemes in the last field as a single unit,
because the field contains single spaced components.
To check one word, you can use a script like that in Listing 4.
Listing 4. Review lexicon phonemes for one word
# # called with "$ python scanlex.py word" # import sys import string as str f = open('lextest', 'r') for line in f: words = str.split(line," "*5) if words == sys.argv: print words
This script begins by importing required modules and opening the lexicon file for reading. It reads the file line by line and prints the entries for which the first field is the same as the word you are searching for, as specified in the command-line call. Note that more than one entry might appear for any given word, each with a different phonemic representation. The lexicon is alphabetically sorted, so it would be perfectly reasonable to add code to stop processing the file after reading the last "word." I recently had an issue with the word seven; once I added a new entry to the lexicon with the slightly nonstandard phonemic representation [s eh v ih n], the problem went away entirely.
Tool 4: Review lexicon choices for the grammar
Tool 4 expands tool 3 by scanning the prompts list and printing the lexicon phoneme representation for each word in the grammar. In addition, it queries Festival for its representation of the word, which can act as a guide to show whether your lexicon version is approximately correct. Listing 5 shows the code.
Listing 5. Review lexicon for grammar words
# # called with "$ python checkphons.py" # import sys import string as str import os with open('prompts', 'r') as f: wdlist =  for line in f: words = str.split(line) if words not in wdlist: wdlist.append(words) if words not in wdlist: wdlist.append(words) with open('lextest', 'r') as f: lexwd =  lexphon =  for line in f: lexs = str.split(line,' '*5) lexwd.append(str.strip(lexs)) lexphon.append(str.strip(lexs)) for wd in wdlist: #print wd if wd in lexwd: pos = lexwd.index(wd) print wd, lexphon[pos] cmd = ("/home/colin/downloads/festival/bin/festival -b \ '(format t \"%l\\n\" (lex.lookup \""+wd+"\") ) '") os.system(cmd)
Tool 4 works by reading the prompts file and storing words in a list. It then does the same for the words and phonemes in the lexicon. In the final stage, it scans the list of prompt words and outputs the phoneme description for that word from the lexicon. For each word, it calls Festival and prints the Festival version of how it thinks the word should sound as a comparison. Here is some example output when this script is run against the minimal lexicon in Listing 3.
BLACK b l ae k ("black" nil (((b l ae k) 1))) BLUE b l uw ("blue" nil (((b l uw) 1))) BROWN b r aw n ("brown" nil (((b r aw n) 1))) BRAVO b r aa v ow ("bravo" nil (((b r aa v) 1) ((ow) 0)))
Note that when examining phonemes, the HTK
tool provides a summary of counts of phonemes used in the grammar, which
is generated as part of the output from the VoxForge tutorial. For larger
grammars, you can filter in the same way as in tool 3.
Using the tools in this article, a recognition failure rate of one in a thousand seems to me—from my own experiments under excellent, quiet conditions—quite reasonable as a goal. Simply adding many more example prompts from many different sources can help smooth out discrepancies in models, provided there are no serious flaws in grammar and lexicon. It might be possible to achieve this goal more quickly by applying some care and attention to the words and audio files used to build the model. The result is a more satisfying product, a more effective way of communicating with your devices, and more relaxed vocal cords.
- Find out more about putting together a speech recognition model in the VoxForge tutorial section.
- Get more information on lexemes, graphemes, and phonemes at the Pronunciation Lexicon Specification (W3C Recommendation, 14 October 2008).
- Find more on the Python scripting language.
- Read more about voice recognition using grammars in Look, Ma! No keyboard! Voice input and response using fixed grammars (Colin Beckingham, developerWorks, November 2010).
- Get an overview of open source software working in a voice/speech context in Querying a database using open source voice control software (Colin Beckingham, linux.com, May 2008).
- Browse all of Colin's articles on developerWorks.
- The Open Source developerWorks zone provides a wealth of information on open source tools and using open source technologies.
- In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth of other resources for Linux developers and administrators.
- developerWorks Web development specializes in articles covering various web-based solutions.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
- Follow developerWorks on Twitter.
Get products and technologies
- Learn more about HTK (Hidden Markov Model Toolkit), a portable toolkit for building and manipulating hidden Markov models.
- Get more information about Carnegie Mellon's CMU Sphinx speech toolkit.
- Find out more about the Julius speech recognition engine.
- Find out more about the Festival text-to-speech engine.
- Get more information on the PostgreSQL relational database system.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Check out developerWorks blogs and get involved in the developerWorks community.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.