Look, Ma! No keyboard! Voice input and response using fixed grammars

Direct the building of both a voice recognition acoustic model grammar and the dialog manager that uses it with SRGS

A variety of plain-text, application-specific formats exists for the definition of non-natural language grammars for the preparation of a voice recognition model. Programmers can use the Speech Recognition Grammar Specification (SRGS) not only to express many of these formats in an open-standards structure, but also to define rules for the dialog manager necessary for interpretation of the output generated by the recognition model. Explore SRGS and Semantic Interpretation for Speech Recognition (SISR)-like methods using PHP in the context of non-natural language-specific grammars in this article.

Colin Beckingham, Writer and Researcher, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



09 November 2010

Also available in Chinese Japanese

State of voice recognition

Frequently used acronyms

  • DTD: Document Type Definition
  • HTK: Hidden Markov Model Toolkit
  • URI: Uniform Resource Identifier
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

As computers diminish in size and increase in portability, the need to interact with them without using a keyboard or mouse increases. Voice is an alternative. Superficially, much less bandwidth is available with voice communications than with visual interaction. As a result of the impression that a picture equals a thousand words, computers display to screens in response to mechanical peripheral input much more readily than they accept audio input and respond in a like manner.

Successful techniques already exist, however limited, to issue instructions to smaller computers with voice. The goal is to have a computer react to speech and take a specific action based on the command. The general process to achieve this goal is to build (or adapt) a model, apply a spoken command against that model in a recognition process, and then decide on an action in a dialog manager. Models can be broad in the sense of recognizing a variety of voices but few commands, or you can train your own model from a specific grammar that gives the possibility of quite complex interpretation and interaction. Figure 1 shows the speech recognition and interpretation development flow.

Figure 1. Speech recognition and interpretation development flow
Diagram of speech recognition and interpretation development flow

It is important to draw a distinction between natural language processing (NLP) and specific n-gram grammars. The latter states categorically what the recognizer can expect to hear, while the former does its best to decode natural language into a simpler structure by discarding some elements of the received speech and rearranging others. While technologies such as SRGS and SISR lean mostly toward the processing of natural language, programmers look for ways to use those same tools for other types of grammars.

This article uses an SRGS approach to defining fixed grammars and addresses the issues of out-of-vocabulary (OOV) prompts and context in the dialog manager using the example of a set of 2-gram or bigram examples.


The programming perspective

From a programming perspective, a few issues are important. First, you can express a grammar in a number of different ways, depending on which speech recognizer application you use. Second, those grammars are poorly coordinated with the dialog managers that use them. Dialog managers are important in their own right because they offer ways to deal with issues such as intelligent response and OOV recognition. Third, because you are talking about specific grammars designed to do a specific job, you have to rewrite or adapt the model and the dialog manager for each one. The more you can autogenerate them, the better.

Intelligent response implies that the computer takes context into account. If your question asks for a temperature and your previous question was about your computer, then it must be the computer's temperature that you need.

OOV is a common problem in small grammars. To put it simply, it says that some prompts are important and others are necessary for building the model but not important for later processing.

Autogeneration is straightforward, using scripts, such as Bash, Perl, and PHP, or regular programming languages, such as C and C++, provided that there are clear rules. And SRGS is designed to encapsulate those rules.


An example

Listing 1 is a plain-text grammar in prototype.

Listing 1. Prototype of grammar
COMPUTER WAKE
COMPUTER STATUS
COMPUTER SLEEP

The grammar in Listing 1 is quite simple and specific. It tells the computer that it hears only three possible prompts. Each prompt starts with the word COMPUTER and can be followed by WAKE, STATUS, or SLEEP. No other commands are possible. The speech recognizer has only one job, which is to choose whichever of the three options it considers to be the closest to what it heard and pass that command to the next stage. For instance, if I say MAKE COFFEE, it returns COMPUTER plus one of the three alternative words. The dialog manager should apply some intelligence. For example, if it hears COMPUTER SLEEP, it should not respond to any more commands until it hears COMPUTER WAKE. It should respond to COMPUTER STATUS only if it is in a WAKE state, at which point it can announce the processor temperature, free space on disk, and a whole host of other interesting things. It is not a practical grammar by any means—when building an acoustic model from grammars as small as this, you soon run into problems regarding insufficient samples. This prototype is intended only as an illustration of the principle.

Training a computer to recognize spoken sounds and apply grammar rules to what it hears is a fairly straightforward process, even in the world of open source. For complete guidance about how to achieve an effective speech recognition system using a fixed vocabulary, see the VoxForge site. The VoxForge tutorials use tools such as HTK from Cambridge University and the Julius voice recognition engine from the University of Nagoya in Japan. See Resources for links to all of these sites.

Building an audio model with HTK requires that you express the grammar in a particular format, as in Listing 2.

Listing 2. Grammar in HTK format
$major =  COMPUTER ;
$minor = WAKE | STATUS | SLEEP ;
( SENT-START ( $major $minor ) SENT-END )

The same process with the Julius engine requires a slightly different format, as in Listing 3.

Listing 3. Grammar in Julius format
S : NS_B SENT NS_E
SENT: MAJOR MINOR
MAJOR: COMPUTER
MINOR: WAKE STATUS SLEEP

The HTK and Julius formats share structural similarities from a programming viewpoint, but they are sufficiently different that they are not interchangeable.

Listing 4 shows a basic dialog manager in PHP that can deal with this grammar.

Listing 4. A plain dialog manager
<?php
...
function dm($prompt_heard) {
global $wake_state; // FALSE is asleep so do not respond, TRUE is awake
  $parts = explode(" ",$prompt_heard);
  $minor = $parts[1];
  switch ($minor) {
    case 'WAKE':
      $wake_state = TRUE ;
    break;
    case 'SLEEP':
      $wake_state = FALSE ;
    break;
    case 'STATUS':
      if ($wake_state) {
        announce_status();
      } else {
        // do nothing
      }
    break;
    default:
      // OOV - any other prompt, just ignore it.
    break;
  }
}
?>

This PHP function passes the result from the recognizer as an argument in $prompt_heard. The $wake_state variable is declared to be a global and known throughout the dialog manager. The explode() dissects the parts of the prompt heard. In this example case, you know that the first part will be COMPUTER, so the following switch looks only at the minor part. If WAKE, then set the wake state to TRUE. If SLEEP, then set the wake state to FALSE. If the minor part is STATUS, then announce the result, but only if the wake state is TRUE. You can announce the status to speakers using a process similar to the one explored in the developerWorks article "PHP bees and audio honey: Accessible agent-based audio alerts and feedback" (see Resources). The default section will catch any additional prompts that you add to exercise the recognizer—for unrecognized prompts don't do anything, except maybe write to a log for later troubleshooting.

From a programming efficiency point of view, if you can even partly autogenerate the grammar and the dialog manager, you'll save time and effort plus improve accuracy. The efforts to create even higher level grammars to do this (see Resources) are largely in the realm of NLP.


How SRGS can help

SRGS is intended to express the same ideas as Listing 2 and Listing 3 but in the more rigorous structure provided by XML, as in Listing 5.

Listing 5. Prototype in SRGS format
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                  "http://www.w3.org/TR/speech-grammar/grammar.dtd">
<grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://www.w3.org/2001/06/grammar 
                             http://www.w3.org/TR/speech-grammar/grammar.xsd"
         version="1.0" 
         mode="voice" 
         root="myroot">
<meta name="author" content="Colin Beckingham"/>
<rule id="myroot" scope="public">
  <example> COMPUTER WAKE </example>
  <example> COMPUTER SLEEP </example>
  <example> COMPUTER STATUS </example>
  <ruleref uri="#command1"/>
  <!-- ruleref uri="#command2"/ -->
</rule>
<rule id="command1">
  <ruleref uri="#major"/> <ruleref uri="#minor"/>
</rule>
<rule id="major">
   <one-of>
      <item> COMPUTER </item>
    </one-of>
</rule>
<rule id="minor">
    <one-of>
      <item> WAKE </item>
      <item> SLEEP </item>
      <item> STATUS </item>
    </one-of>
</rule>
</grammar>

Listing 5 shows the customary XML declaration, followed by a DOCTYPE statement that locates the DTD, in this case pointing at detail related to grammars. It then follows the root element <grammar>. The grammar element contains a number of important attributes, including namespaces, the mode (which in this case is voice because the destination for the data is a speech recognizer), and the ID of the root rule, which is the place to start when looking for prompts to match (in this case, myroot). The myroot rule contains rulerefs that point to other rules. The DTD permits <example> elements for information. The rule with ID command1 follows. Then follow two more rules, major and minor. The major rule contains the single word COMPUTER. The minor rule contains the alternatives WAKE, SLEEP, and STATUS. These last are <item> elements in a <one-of> structure, indicating that one and only one can apply at a time. Using the <ruleref> element of the root rule, the major and minor rules become part of the overall rule structure.

In summary, the rules follow this structure:

  1. The grammar attribute root points at the root rule myroot.
  2. The myroot rule contains one or more rulerefs, each of which has a URI that points to another rule, in this case command1, which is the master rule for an n-gram. The ruleref for command2 is commented out because it is a placeholder for an as yet nonexistent master rule.
  3. Master rule command1 contains rulerefs that show that other rules are used, major followed by minor.

The model generation processes for HTK and Julius are not user agents in W3C terms because they do not currently read SRGS format directly but instead use SRGS to define the grammar calls for a script to translate from SRGS to the other formats. In addition, neither HTK nor Julius generates a dialog manager because of insufficient information. How, for example, can it guess that SLEEP means stop responding?


Translating from SRGS to HTK or Julius

Listing 6 is a simple translator in PHP that examines the SRGS version and creates output in the form of an equivalent basic HTK or Julius format grammar.

Listing 6. Translator
<?php
// test translator: SRGS to HTK/Julius
$xml = simplexml_load_file("mysrgs.xml");
$roote = trim($xml['root']);
// find the root rule, get the basic rulerefs
foreach ($xml->rule as $rule) { 
  if ($rule['id'] == $roote) { 
    foreach ($rule->ruleref as $rref) {
      $rulerefs[] = substr($rref['uri'],1);
} } }
// find the rules indicated by the rulerefs
foreach ($rulerefs as $ruleid) {
  foreach ($xml->rule as $rule) {
    if ($rule['id'] == $ruleid) {
      $i = 0;
      foreach ($rule->ruleref as $rref) {
        $myrules[$ruleid][$i] = substr($rref['uri'],1);
        $i++;
} } } }
// load the words array
foreach ($myrules as $mr=>$myr) {
  foreach ($myr as $md=>$myd) {
    foreach ($xml->rule as $rule) {
      if ($rule['id'] == $myd) {
        foreach ($rule->{'one-of'}->item as $item) {
          $words[$mr][$md][] = $item;
} } } } }
// now the output
$varnamestr .= "";
$jvarnamestr = "";
foreach ($words as $k=>$v) { // master
  $jvarnamestr .= "\nSENT: ";
  foreach ($v as $kk=>$vv) { // sub
    $i = 0;
    foreach ($vv as $wd) { // word
      $j = count($vv);
      $htkwds[$k][$kk] .= " $wd ";
      $i++;
      $jwds[$k][$kk] .= " $wd ";
      if ($i < $j) $htkwds[$k][$kk] .= "|";
    }
    // htk
    $varname = "$".$k."_".$kk."";
    $htkv .= "$varname = ".$htkwds[$k][$kk].";\n";
    $varnamestr .= " ".$varname;
    // julius
    $jvarname = strtoupper($k."_".$kk);
    $jvarwordstr .= "$jvarname: ".$jwds[$k][$kk]."\n";
    $jvarnamestr .= " ".$jvarname." ";
  }
  $varnamestr = $varnamestr." |";
}
// output as HTK
echo "\n-------------------------\n";
echo "HTK Version\n-------------------------\n";
$varnamestr = substr($varnamestr,0,-2);
$htk = $htkv."( SENT-START (".$varnamestr." ) SENT-END )";
echo "$htk\n-------------------------\n";
// output as Julius
echo "Julius Version\n-------------------------\n";
$julius = "S : NS_B SENT NS_E";
$julius .= "$jvarnamestr\n";
$julius .= "$jvarwordstr";
echo "$julius-------------------------\n";
// end
echo "Done\n\n";
?>

The overall goal of the program in Listing 6 is to scan the SRGS document and fill a multidimensional array with SimpleXML objects that represent the prompt structure. When the array is complete, it then generates the string variables required from that array and outputs to the HTK and Julius formats. The array is filled using a series of foreach statements that pick out the root rule, master rules, and the rules that the masters refer to. The result is an array where the first key is the name of the master rule and the second key is the position—0 (major) or 1 (minor). The $i and $j variables are counters that control the addition of a vertical bar (|) which is an OR symbol in the HTK format. Finally, the output uses variables created from the words and the IDs of the master rules. Listing 7 is the output of a sample session.

Listing 7. Translator output
> php mytrans.php

-------------------------
HTK Version
-------------------------
$command_0 =  COMPUTER ;
$command_1 =  WAKE | SLEEP | STATUS ;
$zoo_0 =  ANIMALS ;
$zoo_1 =  TIGER | LION | LEOPARD ;
( SENT-START ( $command_0 $command_1 | $zoo_0 $zoo_1 ) SENT-END )
-------------------------
Julius Version
-------------------------
S : NS_B SENT NS_E
SENT:  COMMAND_0  COMMAND_1
SENT:  ZOO_0  ZOO_1
COMMAND_0:  COMPUTER
COMMAND_1:  WAKE  SLEEP  STATUS
ZOO_0:  ANIMALS
ZOO_1:  TIGER  LION  LEOPARD
-------------------------
Done

This code brings you back to the format of Listings 2 and 3. For testing purposes, a second master rule is added to ensure that it processes multiple master rules. Note that this is a basic translator that does not deal with repeats (for example, where a set of numbers can be repeated, as in a phone number). You still need to define other files depending on vocabulary, lexicon, and phoneme structure chosen before you can build the model, but this at least gives you a start. See the VoxForge tutorial in Resources for further guidance.


Dialog manager generator

The programmer now turns to the dialog manager. It is helpful if you can generate at least part of the dialog from the SRGS source. If you work with a context-free grammar, the structure of n-gram (see Resources) might be what you require. In this current n-gram situation, the grammar is fixed. The grammar contains four words, and using those words gives only three possible answers.

While remaining strictly within the standard, the SRGS definition permits you to add a couple of details that are helpful in generating a dialog manager. First, it allows the addition of a weight attribute to the <item> element as an integer or decimal number. Second, it allows the addition of <tag> elements to rules as children that can contain arbitrary strings. These are most often ECMAScript (JavaScript) expressions. They are commonly used to issue SISR instructions to NLP parsers in browsers, but in this instance they might be useful to you for sending hints to a dialog manager generator.

You already have a little information from the grammar: The bigram format calls for two switch statements, which are the minors nested inside the majors. This much is straightforward. But context and OOV call for a bit more than that. This proposal uses the weight attribute to deal with OOV and the tag element to handle context.

Listing 8. Enhanced SRGS with dialog manager instructions
...
<item weight="1"> COMPUTER <tag>$context = "tech";</tag></item>
...
<item weight=""> WAKE <tag>$wake_state = TRUE;</tag></item>
<item weight=""> SLEEP <tag>$wake_state = FALSE;</tag></item>
<item weight=""> STATUS <tag>
      if ($wake_state) {
        announce_status();
      }</tag></item>
...      
<item weight="0"> ANIMALS <tag></tag></item>
...

Note that the major COMPUTER has a weight of 1, but ANIMALS has a weight of 0. In this context, you want COMPUTER * to be recognized, but ANIMALS * is to be ignored as OOV. Additionally, the tag elements contain snippets of PHP code that the generator can insert.

The goal of the dialog manager generator shown in Listing 9 is to build a dm() function similar to Listing 4.

Listing 9. Dialog manager generator
<?php
// Dialog manager generator
// Colin Beckingham, 2010
// test dmgen
$xml = simplexml_load_file("mysrgs.xml");
$roote = trim($xml['root']);
// find the root rule, get the basic rulerefs
foreach ($xml->rule as $rule) { 
  if ($rule['id'] == $roote) { 
    foreach ($rule->ruleref as $rref) {
      $rulerefs[] = substr($rref['uri'],1);
} } }
// find the rules indicated by the rulerefs
foreach ($rulerefs as $ruleid) {
  foreach ($xml->rule as $rule) {
    if ($rule['id'] == $ruleid) {
      $i = 0;
      foreach ($rule->ruleref as $rref) {
        $myrules[$ruleid][$i] = substr($rref['uri'],1);
        $i++;
} } } }
// load the words array
foreach ($myrules as $mr=>$myr) {
  foreach ($myr as $md=>$myd) {
    foreach ($xml->rule as $rule) {
      if ($rule['id'] == $myd) {
        foreach ($rule->{'one-of'}->item as $item) {
          $words[$mr][$md][] = $item;
} } } } }
$dmout1 = "<?php
...
function dm(\$prompt_heard) {
global \$wake_state; // FALSE is asleep so do not respond, TRUE is awake
  \$parts = explode(\" \",\$prompt_heard);
  \$major = \$parts[0];
  \$minor = \$parts[1];
  switch (\$major) {";
    foreach ($words as $mw) {
      $mww = $mw[0][0]->attributes();
      if ($mww->weight == 1) {
$maj = "    case '".trim($mw[0][0])."':\n      ";
$maj .= "".$mw[0][0]->tag."";
$min1 = "\n      switch (\$minor) {";
        $ins .= "\n$maj$min1";
        foreach ($mw[1] as $mm) {
$min2 = "\n        case '".trim($mm)."':\n";
$min3 = "        ".$mm->tag."\n";
$min4 = "        break;";
        $ins .= "$min2$min3$min4";
      }
$min5 = "\n        case default:\n        break;\n      }";
        $ins .= $min5;
      }
    }
$dmout2= "
    default:
      // OOV - any other prompt, just ignore it.
    break;
  }
}
?>
";
echo "$dmout1$ins$dmout2\n";
?>

The first half of the code in Listing 9 is exactly the same as the code for the translator in Listing 6. This duplication is intentional because eventually the generation of the grammar and the dialog manager can be accomplished in the same program. Having established the $words array with SimpleXML objects, you can now scan through those objects and pick out values for not only items, but also weights and tags. After output of some of the introductory static code, the dialog manager generator iterates through the SimpleXML objects, rendering the output in PHP code format and nesting the switch statements as required.

Listing 10 shows some example output from a test source SRGS file that contains three master rules, one of which should be ignored as test data only.

Listing 10. Generator output
> php dmgen2.php

<?php                                                          
...                                                            
function dm($prompt_heard) {                                   
global $wake_state; // FALSE is asleep so do not respond, TRUE is awake
  $parts = explode(" ",$prompt_heard);                                 
  $major = $parts[0];                                                  
  $minor = $parts[1];                                                  
  switch ($major) {                                                    
    case 'COMPUTER':                                                   
      $context = "tech";                                               
      switch ($minor) {                                                
        case 'WAKE':                                                   
          $wake_state = TRUE;
        break;
        case 'SLEEP':
          $wake_state = FALSE;
        break;
        case 'STATUS':
          if ($wake_state) { announce_status(); }
        break;
        case default:
        break;
      }
    case 'APPLES':
      $context = "fruit";
      switch ($minor) {
        case 'PIPPIN':
          pippin apple
        break;
        case 'DELICIOUS':
          delicious apple
        break;
        case 'SPARTAN':
          spartan apple
        break;
        case default:
        break;
      }
    default:
      // OOV - any other prompt, just ignore it.
    break;
  }
}
?>

The files that generated this output are available in Download.


Conclusion

With SRGS, you can state requirements for a fixed grammar in addition to their usual role in NLP, providing a central location for the generation of both grammar and dialog manager files. By using the weight attribute to define whether a master rule is to be detected and the <tag> element to instruct the dialog manager as to what action to take when a specific prompt is detected, autogeneration of grammar and dialog managers is more rigorous and effective.

To interact with a computer solely by voice is much more difficult work t than using various hardware input devices and monitoring the computer state with visual feedback. As voice programmers, a primary objective should be to make it simpler for users to interact by voice—particularly for those who have no choice but to use voice and ears. While the grammar and dialog manager generators presented here are far from simple themselves, their product can make the process of making simple, stackable tools easier.


Download

DescriptionNameSize
Sample files using PHP to exercise SRGS and SISRsrgsphp.zip3KB

Resources

Learn

Get products and technologies

  • Julius: Explore a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers that is based on word N-gram and context-dependent HMM.
  • HTK: Delve into this portable toolkit for building and manipulating hidden Markov models. Primarily used for speech recognition research, HTK is also used for research in speech synthesis, character recognition, and DNA sequencing.
  • IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=572544
ArticleTitle=Look, Ma! No keyboard! Voice input and response using fixed grammars
publish-date=11092010