Tweak your audio model for better speech recognition

Fine tune audio input for accuracy with these strategies and tools

Dealing with an inadequately prepared audio model can be frustrating, particularly for beginners in the speech recognition field who are working with their own speaker-dependent models. Unlike keyboard and mouse input, which is relatively positive in action and easily interpreted by the operating system, audio input to a speech recognizer is less positive and depends heavily on the breadth and depth of the audio model. Programmers can ease the process of analyzing recognition errors by providing tools. A reasonable goal is to move from five errors in 10 to less than one in a thousand: Find out how using tools constructed with Python and PostgreSQL.

Colin Beckingham, Writer and Researcher, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



12 June 2012

Also available in Chinese Japanese Portuguese

Terminology

  • Grammar. A finite set of single words or combinations of words from which the speech recognizer is allowed to choose
  • Grapheme. How a word is written or printed
  • Lexeme. A general description of how a word is written and pronounced
  • Lexicon. A list of lexemes; in this instance, a file containing on each line how a word is both written and pronounced
  • Phoneme. A marker for a sound; several phonemes strung together represent how a word sounds
  • PLS. See the link to the Pronunciation Lexicon Specification in Resources for more on these terms

You can easily create a simple speaker-dependent audio model for speech recognition by following the VoxForge tutorial, using tools like the Hidden Markov Model Toolkit (HTK), Julius, and CMU Sphinx (see Resources for more on these tools). In less than a few hours, you can have your own 2001: A Space Odyssey HAL 9000 computer responding to your commands and conversing with you with little possibility of finding the door of your room locked if you leave your workstation to go to the bathroom.

However, beginners are sometimes surprised that the recognition accuracy of their first model is unacceptable. It's easy to give up on further experimentation in voice control, thinking that the field has not advanced to a passable level of accuracy. This is far from the truth.

In addition—and perhaps more seriously—the more errors you find and the harder you focus on speaking carefully into the microphone to suit an apparently balky recognizer, the more you may verge on straining your voice as you try to control your vocal chords in unnatural ways to drive the expected results. There are good reasons to strive for high accuracy in voice recognition with relaxed input.

Simply adding more audio samples might be an effective remedy for this problem. Also, adapting a preconfigured speaker-independent model is another possible avenue (see the VoxForge and CMU Sphinx resources for guidance on how you can do this). A third method is to focus on the problem of poor recognition accuracy using programming tools and devise corrections targeted at the fundamental issue.

Programmers can play a role in alleviating poor recognition accuracy. Tools that describe the environment in which you are working can help identify the nature of recognition problems and encourage users to focus quickly on the basic issue and potential solutions.

The HTK provides many tools for the creation phase of the model, and Julius provides descriptive messages about the recognition phase. But there are limits to this approach: For example, the writers of HTK and Julius can make no assumptions about what grammar you want to use. You are responsible for handling errors arising out of your choice of grammar.

The sections that follow discuss some specific problems that can arise, with examples and commentary on whether there is potential for a special tool to help identify and resolve the issue.

Types of problems

Looking at the types of errors that can interfere with accuracy of recognition, you can discern a number of different reasons for errors:

  • The familiar concept of "garbage in, garbage out" leads to the fuzzy model problem. Unless the sample prompts provided to the audio model creation process are clean and relevant, HTK cannot create the model properly. For example, you can have recorded .wav files matched with the wrong prompt; the .wav file is acceptable but only deciphered by HTK with difficulty (perhaps because of volume discrepancies); elided sounds resulting from short silences between words; or clicks, pops, and background sounds intruding. The solution is to investigate audio file problems and perhaps re-record.
  • The narrow model pops up if you try to use a speaker-dependent model trained with a voice other than your own. In this case, your results are likely to be poor. In addition, if you train a model yourself with a wire-connected headset, and then try to use the generated model to issue commands with a Bluetooth headset, chances are that your recognition accuracy will fall below 50%, even though you are using the same voice. And yet further, with the same headset but connected to a different machine, your results will again be a disappointment. You might call this inadequate breadth of the model, and you can remedy it by adding audio files specific to the person, device, or platform where the problem occurs.
  • Audio models need both breadth and depth. Individuals have different ways of saying the same thing on different days because of emotional state, atmospheric conditions, and so on. With no change in operating platform, speaker, or headset device, if the speaker has a cold or feels impatient or is slightly tipsy, the voice will change either dramatically or subtly—enough to confuse a model trained under different conditions, leading to a shallow model (1). You can increase the depth by examining the lexicon closely for accurate descriptions of how words sound and editing or adding more lexicon entries if necessary.
  • The shallow model (2) is similar to the previous point but not quite the same. On the same day with the same voice, in the same conditions, there are words that we pronounce differently based perhaps on context or whim. For example, Tanzania: is that [Tanza-niya] or [Tan-Zania] or both? Schedule ([sked-ule], [shed-ule]), status ([stay-tus], [stattus])? You can force yourself to be consistent or add more entries in the lexicon. Another subtle issue is long and short silences between words in audio recordings.
  • The model is based on phonemes, which are discrete sound representations of what the recognizer expects to hear. The phonemes are stored in the lexicon, and inaccuracies or inconsistencies in the phoneme and their associated sounds can lead to errors. You can smooth out the issue of incorrect phoneme breakdown of words used or missing alternate pronunciations by identifying and correcting the entry in the lexicon.
  • Unfortunately, many languages offer a poor balance of phoneme usage in normal speech, with the result being an incomplete phonemic balance in the grammar. Where the phonemes are poorly represented because of infrequent use in the language, you can expect problems, because your opportunity to exercise them will be limited. In my own English grammars, phonemes in the set ([ax], [n], [s], [t]) tend to be heavily used, and at the other end of the spectrum, those in the set ([oy], [ar], [el], [ur]) are rarely used. The solution is to identify the problem rules, and then adjust grammar, use synonyms, view output from HDMan, or even invent your own nonstandard terms that exercise the least-used phonemes.
  • The phonemically closely related lexemes problem arises from everyday language that coerces us into using words that may be difficult to separate at the recognizer level, so the choice of commands in the grammar is a fine art. My models sometimes confuse pipe and nine: My solution was to change pipe to the compound words pipe_symbol or vertical_bar.

Tools

If your grammar has only a few rules and one of them consistently displays recognition problems, then you can focus quite quickly on the problem. However, with larger grammars you need a method of systematically testing the model's accuracy.

Tool 1: Identify problem words

Tool 1 is a testing routine to identify the word combinations that are a problem and store them for later review and revision. The benefit here is rather than adding more samples in general, you are just adding more examples of those grammar rules that the recognizer has difficulty with, which saves time.

Here, the machine reads the prompts file, stores the entries in an array and randomly shuffles it, then presents the prompts to you one by one on the screen or through audio, asking you to speak that prompt to the recognizer. The script then compares the output from Julius with what it was expecting to hear. If they are not the same, then it records the problem for later analysis and loops to the next prompt.

This script uses Python (see Resources) and stores the issues in a PostgreSQL (see Resources) back end. You can do the same thing with PHP, Perl, and MySQL or other tools. The choice makes a few assumptions about your working environment: First, you have a flat file of "prompts" that consists of a list of testable grammar rules and the associated audio file names. These are arranged as file name followed by the prompt, which can be either a single word or two words. Here is an example. In this tool, you aren't using the audio file name at all, so a separate file with just sample prompts would work as well:

*/mysample1     ONE
*/mysample2     COMPUTER     QUIT

Also, you have a PostgreSQL table set up with four fields: an ID number to make records unique and three character-varying fields—one to store the device you are testing with, another to store what the computer asked for, and the third to store what the recognizer thought it heard. In Listing 1, testprom.py is the name of the script, headset is an identifier for the headset you are using, and 100 is the number of prompts you want to test before the script stops. The output from Julius is piped into the script.

Listing 1. Test your prompts with Python and PostgreSQL
#
# call with:
# julius (your options) -quiet -C julian.jconf | python testprom.py headset 100
#
import sys
import string as str
import os
import random
import psycopg2
# database setup
conn = psycopg2.connect("host='xxx' user='yyy' password='zzz' dbname='qqq'")
cur = conn.cursor()
# get the command line arguments
device = sys.argv[1]
limit = int(sys.argv[2]) # convert CLI string to integer
# get the prompts
proms = []
f = open('prompts', 'r')
for prompt in f:
  words = str.split(prompt)
  if len(words) == 2:
    thisprom = words[1]
  if len(words) == 3:
    thisprom = words[1]+' '+words[2]
  if thisprom not in proms:
    proms.append(thisprom)
f.close()
random.shuffle(proms)
# run the tests
i = 0
challengeprom = "please say something"
print challengeprom
while i < limit:
  line = sys.stdin.readline()
  if line[:4] == "sent":
    line = str.replace(line,"sentence1: <s> ","")
    line = str.replace(line," </s>","")
    heardprom = str.strip(line)
    if heardprom == challengeprom or i == 0:
      print "OK"
    else:
      sql = "insert into gramerrors (device, prompt, heard) values (%s,%s,%s)"
      data = (device,challengeprom,heardprom)
      cur.execute(sql, data)
      conn.commit()
      print "problem recorded: <"+challengeprom+'> <'+heardprom+'>'
    #
    challengeprom = proms[i]
    print challengeprom
    i += 1
# tidy up
cur.close()
conn.close()

In Listing 1 (it is fundamentally a highly specialized dialog manager), you first import the required libraries, then establish the connection to the PostgreSQL back end and read in the command-line parameters. The script then opens the prompts file for reading and reads line by line, storing the prompts in a list but only if they are not already in the list. Once it has finished reading, it randomly shuffles the list.

Next, the script loops through the shuffled prompts starting at the top of the list. It presents each one on the screen (you could also have Festival—see Resources—announce it) and waits for output from the speech recognizer, which is triggered when you speak into the microphone; julius decodes what it hears. The script then compares the expected prompt with the prompt actually heard. Ideally, they will be the same thing, and the script proceeds to the next prompt. If there is a problem, the script records the device, expected prompt, and actual prompt in the back end before proceeding to the next one. It stops when the limit is reached or the file is exhausted.

The result is a table of prompts that have been problematic and the device that raised the problem. This table is informative if you regularly use more than one device. As you run more tests on different days with different devices, you build a picture of where problems occur. Is it always the same device? The same prompts? And does Julius consistently output the same incorrect grammar rule? This is your guide for further analysis.

Tool 2: Review existing audio samples

Perhaps your testing shows that one prompt is consistently a problem. Your first suspicion might be that one or more of the audio files for that prompt is bad and needs to be replaced. Maybe the audio file does not match the prompt, or the volume is too high and distorted or too low and HTK had a problem extracting meaningful data. The information you need is in the prompts file. The script in Listing 2 loops through selected prompts and plays the audio.

Listing 2. Review audio samples
#
# called with "$ python audioreview.py word1 word2"
#
import sys
import string as str
import os
f = open('prompts', 'r')
path = './wavs/'
for line in f:
  words = str.split(line)
  if words[1] == sys.argv[1] and words[2] == sys.argv[2]:
    mywav = words[0] + '.wav'
    wavfil = path + mywav[2:]
    os.system("aplay " + wavfil)

The script imports a few modules and opens the prompts file for reading. Then, reading one line at a time, it compares the words in the prompts with the words specified on the command line. If they are the same, the script plays the related audio file using a system call to an audio player—in this case, aplay. The script pauses as it plays each file. In this case, you are relying on your ears to tell you whether one of the audio files should be replaced. The script is a handy tool to quickly review audio. If you find a bad one, just replace the audio file using your normal recording process.

Tool 3: Review the choice of phonemes in the lexicon

The lexicon is a list of words and their phoneme representation. Listing 3 provides an example of what could appear in the lexicon.

Listing 3. Sample lexicon entries
BLACK     [BLACK]     b l ae k
BLUE     [BLUE]     b l uw
BRAVO     [BRAVO]     b r aa v ow
BROWN     [BROWN]     b r aw n

The lexicon contains three fields on each line: the word (lexeme), its representation when printed (grapheme), and finally the ordered phoneme set that represents the sounded word. In this case, the three fields are separated by five spaces: It is important to note this separator to allow the split method in Listing 4 to retain the phonemes in the last field as a single unit, because the field contains single spaced components.

To check one word, you can use a script like that in Listing 4.

Listing 4. Review lexicon phonemes for one word
#
# called with "$ python scanlex.py word"
#
import sys
import string as str
f = open('lextest', 'r')
for line in f:
  words = str.split(line," "*5)
  if words[0] == sys.argv[1]:
    print words

This script begins by importing required modules and opening the lexicon file for reading. It reads the file line by line and prints the entries for which the first field is the same as the word you are searching for, as specified in the command-line call. Note that more than one entry might appear for any given word, each with a different phonemic representation. The lexicon is alphabetically sorted, so it would be perfectly reasonable to add code to stop processing the file after reading the last "word." I recently had an issue with the word seven; once I added a new entry to the lexicon with the slightly nonstandard phonemic representation [s eh v ih n], the problem went away entirely.

Tool 4: Review lexicon choices for the grammar

Tool 4 expands tool 3 by scanning the prompts list and printing the lexicon phoneme representation for each word in the grammar. In addition, it queries Festival for its representation of the word, which can act as a guide to show whether your lexicon version is approximately correct. Listing 5 shows the code.

Listing 5. Review lexicon for grammar words
#
# called with "$ python checkphons.py"
#
import sys
import string as str
import os
with open('prompts', 'r') as f:
  wdlist = []
  for line in f:
    words = str.split(line)
    if words[1] not in wdlist:
      wdlist.append(words[1])
    if words[2] not in wdlist:
      wdlist.append(words[2])
with open('lextest', 'r') as f:
  lexwd = []
  lexphon = []
  for line in f:
    lexs = str.split(line,' '*5)
    lexwd.append(str.strip(lexs[0]))
    lexphon.append(str.strip(lexs[2]))
for wd in wdlist:
  #print wd
  if wd in lexwd:
    pos = lexwd.index(wd)
    print wd, lexphon[pos]
    cmd = ("/home/colin/downloads/festival/bin/festival -b \
	      '(format t \"%l\\n\" (lex.lookup \""+wd+"\") ) '")
    os.system(cmd)

Tool 4 works by reading the prompts file and storing words in a list. It then does the same for the words and phonemes in the lexicon. In the final stage, it scans the list of prompt words and outputs the phoneme description for that word from the lexicon. For each word, it calls Festival and prints the Festival version of how it thinks the word should sound as a comparison. Here is some example output when this script is run against the minimal lexicon in Listing 3.

BLACK b l ae k
("black" nil (((b l ae k) 1)))
BLUE b l uw
("blue" nil (((b l uw) 1)))
BROWN b r aw n
("brown" nil (((b r aw n) 1)))
BRAVO b r aa v ow
("bravo" nil (((b r aa v) 1) ((ow) 0)))

Note that when examining phonemes, the HTK HDMan tool provides a summary of counts of phonemes used in the grammar, which is generated as part of the output from the VoxForge tutorial. For larger grammars, you can filter in the same way as in tool 3.


Conclusion

Using the tools in this article, a recognition failure rate of one in a thousand seems to me—from my own experiments under excellent, quiet conditions—quite reasonable as a goal. Simply adding many more example prompts from many different sources can help smooth out discrepancies in models, provided there are no serious flaws in grammar and lexicon. It might be possible to achieve this goal more quickly by applying some care and attention to the words and audio files used to build the model. The result is a more satisfying product, a more effective way of communicating with your devices, and more relaxed vocal cords.

Resources

Learn

Get products and technologies

  • Learn more about HTK (Hidden Markov Model Toolkit), a portable toolkit for building and manipulating hidden Markov models.
  • Get more information about Carnegie Mellon's CMU Sphinx speech toolkit.
  • Find out more about the Julius speech recognition engine.
  • Find out more about the Festival text-to-speech engine.
  • Get more information on the PostgreSQL relational database system.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=820329
ArticleTitle=Tweak your audio model for better speech recognition
publish-date=06122012