As computers diminish in size and increase in portability, the need to interact with them without using a keyboard or mouse increases. Voice is an alternative. Superficially, much less bandwidth is available with voice communications than with visual interaction. As a result of the impression that a picture equals a thousand words, computers display to screens in response to mechanical peripheral input much more readily than they accept audio input and respond in a like manner.
Successful techniques already exist, however limited, to issue instructions to smaller computers with voice. The goal is to have a computer react to speech and take a specific action based on the command. The general process to achieve this goal is to build (or adapt) a model, apply a spoken command against that model in a recognition process, and then decide on an action in a dialog manager. Models can be broad in the sense of recognizing a variety of voices but few commands, or you can train your own model from a specific grammar that gives the possibility of quite complex interpretation and interaction. Figure 1 shows the speech recognition and interpretation development flow.
Figure 1. Speech recognition and interpretation development flow
It is important to draw a distinction between natural language processing (NLP) and specific n-gram grammars. The latter states categorically what the recognizer can expect to hear, while the former does its best to decode natural language into a simpler structure by discarding some elements of the received speech and rearranging others. While technologies such as SRGS and SISR lean mostly toward the processing of natural language, programmers look for ways to use those same tools for other types of grammars.
This article uses an SRGS approach to defining fixed grammars and addresses the issues of out-of-vocabulary (OOV) prompts and context in the dialog manager using the example of a set of 2-gram or bigram examples.
From a programming perspective, a few issues are important. First, you can express a grammar in a number of different ways, depending on which speech recognizer application you use. Second, those grammars are poorly coordinated with the dialog managers that use them. Dialog managers are important in their own right because they offer ways to deal with issues such as intelligent response and OOV recognition. Third, because you are talking about specific grammars designed to do a specific job, you have to rewrite or adapt the model and the dialog manager for each one. The more you can autogenerate them, the better.
Intelligent response implies that the computer takes context into account. If your question asks for a temperature and your previous question was about your computer, then it must be the computer's temperature that you need.
OOV is a common problem in small grammars. To put it simply, it says that some prompts are important and others are necessary for building the model but not important for later processing.
Autogeneration is straightforward, using scripts, such as Bash, Perl, and PHP, or regular programming languages, such as C and C++, provided that there are clear rules. And SRGS is designed to encapsulate those rules.
Listing 1 is a plain-text grammar in prototype.
Listing 1. Prototype of grammar
COMPUTER WAKE COMPUTER STATUS COMPUTER SLEEP |
The grammar in Listing 1 is quite simple and specific. It tells the computer that it hears only three possible prompts. Each prompt starts with the word COMPUTER and can be followed by WAKE, STATUS, or SLEEP. No other commands are possible. The speech recognizer has only one job, which is to choose whichever of the three options it considers to be the closest to what it heard and pass that command to the next stage. For instance, if I say MAKE COFFEE, it returns COMPUTER plus one of the three alternative words. The dialog manager should apply some intelligence. For example, if it hears COMPUTER SLEEP, it should not respond to any more commands until it hears COMPUTER WAKE. It should respond to COMPUTER STATUS only if it is in a WAKE state, at which point it can announce the processor temperature, free space on disk, and a whole host of other interesting things. It is not a practical grammar by any means—when building an acoustic model from grammars as small as this, you soon run into problems regarding insufficient samples. This prototype is intended only as an illustration of the principle.
Training a computer to recognize spoken sounds and apply grammar rules to what it hears is a fairly straightforward process, even in the world of open source. For complete guidance about how to achieve an effective speech recognition system using a fixed vocabulary, see the VoxForge site. The VoxForge tutorials use tools such as HTK from Cambridge University and the Julius voice recognition engine from the University of Nagoya in Japan. See Resources for links to all of these sites.
Building an audio model with HTK requires that you express the grammar in a particular format, as in Listing 2.
Listing 2. Grammar in HTK format
$major = COMPUTER ; $minor = WAKE | STATUS | SLEEP ; ( SENT-START ( $major $minor ) SENT-END ) |
The same process with the Julius engine requires a slightly different format, as in Listing 3.
Listing 3. Grammar in Julius format
S : NS_B SENT NS_E SENT: MAJOR MINOR MAJOR: COMPUTER MINOR: WAKE STATUS SLEEP |
The HTK and Julius formats share structural similarities from a programming viewpoint, but they are sufficiently different that they are not interchangeable.
Listing 4 shows a basic dialog manager in PHP that can deal with this grammar.
Listing 4. A plain dialog manager
<?php
...
function dm($prompt_heard) {
global $wake_state; // FALSE is asleep so do not respond, TRUE is awake
$parts = explode(" ",$prompt_heard);
$minor = $parts[1];
switch ($minor) {
case 'WAKE':
$wake_state = TRUE ;
break;
case 'SLEEP':
$wake_state = FALSE ;
break;
case 'STATUS':
if ($wake_state) {
announce_status();
} else {
// do nothing
}
break;
default:
// OOV - any other prompt, just ignore it.
break;
}
}
?>
|
This PHP function passes the result from the recognizer as an argument in
$prompt_heard. The $wake_state variable is declared to be a global and known throughout
the dialog manager. The explode() dissects the
parts of the prompt heard. In this example case, you know that the first
part will be COMPUTER, so the following switch looks only at the minor part. If WAKE, then set the wake state to TRUE. If
SLEEP, then set the wake state to FALSE. If the minor part is STATUS, then
announce the result, but only if the wake state is TRUE. You can announce the status to speakers using a process similar
to the one explored in the developerWorks article "PHP bees and audio
honey: Accessible agent-based audio alerts and feedback" (see Resources). The default section will catch any
additional prompts that you add to exercise the recognizer—for unrecognized prompts don't do anything, except maybe write to a log for later troubleshooting.
From a programming efficiency point of view, if you can even partly autogenerate the grammar and the dialog manager, you'll save time and effort plus improve accuracy. The efforts to create even higher level grammars to do this (see Resources) are largely in the realm of NLP.
SRGS is intended to express the same ideas as Listing 2 and Listing 3 but in the more rigorous structure provided by XML, as in Listing 5.
Listing 5. Prototype in SRGS format
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
"http://www.w3.org/TR/speech-grammar/grammar.dtd">
<grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/06/grammar
http://www.w3.org/TR/speech-grammar/grammar.xsd"
version="1.0"
mode="voice"
root="myroot">
<meta name="author" content="Colin Beckingham"/>
<rule id="myroot" scope="public">
<example> COMPUTER WAKE </example>
<example> COMPUTER SLEEP </example>
<example> COMPUTER STATUS </example>
<ruleref uri="#command1"/>
<!-- ruleref uri="#command2"/ -->
</rule>
<rule id="command1">
<ruleref uri="#major"/> <ruleref uri="#minor"/>
</rule>
<rule id="major">
<one-of>
<item> COMPUTER </item>
</one-of>
</rule>
<rule id="minor">
<one-of>
<item> WAKE </item>
<item> SLEEP </item>
<item> STATUS </item>
</one-of>
</rule>
</grammar>
|
Listing 5 shows the customary XML declaration, followed by a DOCTYPE statement that locates the DTD, in this case pointing at detail related to grammars. It then follows the root element <grammar>. The grammar element contains a number of important attributes, including namespaces, the mode (which in this case is voice because the destination for the data is a speech recognizer), and the ID of the root rule, which is the place to start when looking for prompts to match (in this case, myroot). The myroot rule contains rulerefs that point to other rules. The DTD permits <example> elements for information. The rule with ID command1 follows. Then follow two more rules, major and minor. The major rule contains the single word COMPUTER. The minor rule contains the alternatives WAKE, SLEEP, and STATUS. These last are <item> elements in a <one-of> structure, indicating that one and only one can apply at a time. Using the <ruleref> element of the root rule, the major and minor rules become part of the overall rule structure.
In summary, the rules follow this structure:
- The grammar attribute
rootpoints at the root rulemyroot. - The
myrootrule contains one or morerulerefs, each of which has a URI that points to another rule, in this casecommand1, which is themaster rulefor an n-gram. Therulerefforcommand2is commented out because it is a placeholder for an as yet nonexistent master rule. - Master rule
command1containsrulerefs that show that other rules are used,majorfollowed byminor.
The model generation processes for HTK and Julius are not user agents in
W3C terms because they do not currently read SRGS format directly but
instead use SRGS to define the grammar calls for a script to translate
from SRGS to the other formats. In addition, neither HTK nor Julius
generates a dialog manager because of insufficient information. How, for
example, can it guess that SLEEP means stop responding?
Translating from SRGS to HTK or Julius
Listing 6 is a simple translator in PHP that examines the SRGS version and creates output in the form of an equivalent basic HTK or Julius format grammar.
Listing 6. Translator
<?php
// test translator: SRGS to HTK/Julius
$xml = simplexml_load_file("mysrgs.xml");
$roote = trim($xml['root']);
// find the root rule, get the basic rulerefs
foreach ($xml->rule as $rule) {
if ($rule['id'] == $roote) {
foreach ($rule->ruleref as $rref) {
$rulerefs[] = substr($rref['uri'],1);
} } }
// find the rules indicated by the rulerefs
foreach ($rulerefs as $ruleid) {
foreach ($xml->rule as $rule) {
if ($rule['id'] == $ruleid) {
$i = 0;
foreach ($rule->ruleref as $rref) {
$myrules[$ruleid][$i] = substr($rref['uri'],1);
$i++;
} } } }
// load the words array
foreach ($myrules as $mr=>$myr) {
foreach ($myr as $md=>$myd) {
foreach ($xml->rule as $rule) {
if ($rule['id'] == $myd) {
foreach ($rule->{'one-of'}->item as $item) {
$words[$mr][$md][] = $item;
} } } } }
// now the output
$varnamestr .= "";
$jvarnamestr = "";
foreach ($words as $k=>$v) { // master
$jvarnamestr .= "\nSENT: ";
foreach ($v as $kk=>$vv) { // sub
$i = 0;
foreach ($vv as $wd) { // word
$j = count($vv);
$htkwds[$k][$kk] .= " $wd ";
$i++;
$jwds[$k][$kk] .= " $wd ";
if ($i < $j) $htkwds[$k][$kk] .= "|";
}
// htk
$varname = "$".$k."_".$kk."";
$htkv .= "$varname = ".$htkwds[$k][$kk].";\n";
$varnamestr .= " ".$varname;
// julius
$jvarname = strtoupper($k."_".$kk);
$jvarwordstr .= "$jvarname: ".$jwds[$k][$kk]."\n";
$jvarnamestr .= " ".$jvarname." ";
}
$varnamestr = $varnamestr." |";
}
// output as HTK
echo "\n-------------------------\n";
echo "HTK Version\n-------------------------\n";
$varnamestr = substr($varnamestr,0,-2);
$htk = $htkv."( SENT-START (".$varnamestr." ) SENT-END )";
echo "$htk\n-------------------------\n";
// output as Julius
echo "Julius Version\n-------------------------\n";
$julius = "S : NS_B SENT NS_E";
$julius .= "$jvarnamestr\n";
$julius .= "$jvarwordstr";
echo "$julius-------------------------\n";
// end
echo "Done\n\n";
?>
|
The overall goal of the program in Listing 6 is to
scan the SRGS document and fill a multidimensional array with SimpleXML
objects that represent the prompt structure. When the array is complete,
it then generates the string variables required from that array and
outputs to the HTK and Julius formats. The array is filled using a series
of foreach statements that pick out the root
rule, master rules, and the rules that the masters refer to. The result is
an array where the first key is the name of the master rule and the second
key is the position—0 (major) or 1 (minor). The $i and $j variables are counters that
control the addition of a vertical bar (|) which is an OR symbol in the HTK format. Finally, the output uses variables created from the words and the IDs of the master rules. Listing 7 is the output of a sample session.
Listing 7. Translator output
> php mytrans.php ------------------------- HTK Version ------------------------- $command_0 = COMPUTER ; $command_1 = WAKE | SLEEP | STATUS ; $zoo_0 = ANIMALS ; $zoo_1 = TIGER | LION | LEOPARD ; ( SENT-START ( $command_0 $command_1 | $zoo_0 $zoo_1 ) SENT-END ) ------------------------- Julius Version ------------------------- S : NS_B SENT NS_E SENT: COMMAND_0 COMMAND_1 SENT: ZOO_0 ZOO_1 COMMAND_0: COMPUTER COMMAND_1: WAKE SLEEP STATUS ZOO_0: ANIMALS ZOO_1: TIGER LION LEOPARD ------------------------- Done |
This code brings you back to the format of Listings 2 and 3. For testing purposes, a second master rule is added to ensure that it processes multiple master rules. Note that this is a basic translator that does not deal with repeats (for example, where a set of numbers can be repeated, as in a phone number). You still need to define other files depending on vocabulary, lexicon, and phoneme structure chosen before you can build the model, but this at least gives you a start. See the VoxForge tutorial in Resources for further guidance.
The programmer now turns to the dialog manager. It is helpful if you can generate at least part of the dialog from the SRGS source. If you work with a context-free grammar, the structure of n-gram (see Resources) might be what you require. In this current n-gram situation, the grammar is fixed. The grammar contains four words, and using those words gives only three possible answers.
While remaining strictly within the standard, the SRGS definition permits you to add a couple of details that are helpful in generating a dialog manager. First, it allows the addition of a weight attribute to the <item> element as an integer or decimal number. Second, it allows the addition of <tag> elements to rules as children that can contain arbitrary strings. These are most often ECMAScript (JavaScript) expressions. They are commonly used to issue SISR instructions to NLP parsers in browsers, but in this instance they might be useful to you for sending hints to a dialog manager generator.
You already have a little information from the grammar: The bigram format calls for two switch statements, which are the minors nested inside the majors. This much is straightforward. But context and OOV call for a bit more than that. This proposal uses the weight attribute to deal with OOV and the tag element to handle context.
Listing 8. Enhanced SRGS with dialog manager instructions
...
<item weight="1"> COMPUTER <tag>$context = "tech";</tag></item>
...
<item weight=""> WAKE <tag>$wake_state = TRUE;</tag></item>
<item weight=""> SLEEP <tag>$wake_state = FALSE;</tag></item>
<item weight=""> STATUS <tag>
if ($wake_state) {
announce_status();
}</tag></item>
...
<item weight="0"> ANIMALS <tag></tag></item>
...
|
Note that the major COMPUTER has a weight of 1,
but ANIMALS has a weight of 0. In this context, you want COMPUTER * to be recognized, but ANIMALS * is to be ignored as OOV. Additionally, the tag elements contain snippets of PHP code that the generator can insert.
The goal of the dialog manager generator shown in Listing 9 is to build a dm() function similar to Listing 4.
Listing 9. Dialog manager generator
<?php
// Dialog manager generator
// Colin Beckingham, 2010
// test dmgen
$xml = simplexml_load_file("mysrgs.xml");
$roote = trim($xml['root']);
// find the root rule, get the basic rulerefs
foreach ($xml->rule as $rule) {
if ($rule['id'] == $roote) {
foreach ($rule->ruleref as $rref) {
$rulerefs[] = substr($rref['uri'],1);
} } }
// find the rules indicated by the rulerefs
foreach ($rulerefs as $ruleid) {
foreach ($xml->rule as $rule) {
if ($rule['id'] == $ruleid) {
$i = 0;
foreach ($rule->ruleref as $rref) {
$myrules[$ruleid][$i] = substr($rref['uri'],1);
$i++;
} } } }
// load the words array
foreach ($myrules as $mr=>$myr) {
foreach ($myr as $md=>$myd) {
foreach ($xml->rule as $rule) {
if ($rule['id'] == $myd) {
foreach ($rule->{'one-of'}->item as $item) {
$words[$mr][$md][] = $item;
} } } } }
$dmout1 = "<?php
...
function dm(\$prompt_heard) {
global \$wake_state; // FALSE is asleep so do not respond, TRUE is awake
\$parts = explode(\" \",\$prompt_heard);
\$major = \$parts[0];
\$minor = \$parts[1];
switch (\$major) {";
foreach ($words as $mw) {
$mww = $mw[0][0]->attributes();
if ($mww->weight == 1) {
$maj = " case '".trim($mw[0][0])."':\n ";
$maj .= "".$mw[0][0]->tag."";
$min1 = "\n switch (\$minor) {";
$ins .= "\n$maj$min1";
foreach ($mw[1] as $mm) {
$min2 = "\n case '".trim($mm)."':\n";
$min3 = " ".$mm->tag."\n";
$min4 = " break;";
$ins .= "$min2$min3$min4";
}
$min5 = "\n case default:\n break;\n }";
$ins .= $min5;
}
}
$dmout2= "
default:
// OOV - any other prompt, just ignore it.
break;
}
}
?>
";
echo "$dmout1$ins$dmout2\n";
?>
|
The first half of the code in Listing 9 is exactly the
same as the code for the translator in Listing 6. This duplication is intentional because eventually the generation of the grammar and the dialog manager can be accomplished in the same program. Having established the $words array with SimpleXML objects, you can now scan through those objects and pick out values for not only items, but also weights and tags. After output of some of the introductory static code, the dialog manager generator iterates through the SimpleXML objects, rendering the output in PHP code format and nesting the switch statements as required.
Listing 10 shows some example output from a test source SRGS file that contains three master rules, one of which should be ignored as test data only.
Listing 10. Generator output
> php dmgen2.php
<?php
...
function dm($prompt_heard) {
global $wake_state; // FALSE is asleep so do not respond, TRUE is awake
$parts = explode(" ",$prompt_heard);
$major = $parts[0];
$minor = $parts[1];
switch ($major) {
case 'COMPUTER':
$context = "tech";
switch ($minor) {
case 'WAKE':
$wake_state = TRUE;
break;
case 'SLEEP':
$wake_state = FALSE;
break;
case 'STATUS':
if ($wake_state) { announce_status(); }
break;
case default:
break;
}
case 'APPLES':
$context = "fruit";
switch ($minor) {
case 'PIPPIN':
pippin apple
break;
case 'DELICIOUS':
delicious apple
break;
case 'SPARTAN':
spartan apple
break;
case default:
break;
}
default:
// OOV - any other prompt, just ignore it.
break;
}
}
?>
|
The files that generated this output are available in Download.
With SRGS, you can state requirements for a fixed grammar in addition to their usual role in NLP, providing a central location for the generation of both grammar and dialog manager files. By using the weight attribute to define whether a master rule is to be detected and the <tag> element to instruct the dialog manager as to what action to take when a specific prompt is detected, autogeneration of grammar and dialog managers is more rigorous and effective.
To interact with a computer solely by voice is much more difficult work t than using various hardware input devices and monitoring the computer state with visual feedback. As voice programmers, a primary objective should be to make it simpler for users to interact by voice—particularly for those who have no choice but to use voice and ears. While the grammar and dialog manager generators presented here are far from simple themselves, their product can make the process of making simple, stackable tools easier.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample files using PHP to exercise SRGS and SISR | srgsphp.zip | 3KB | HTTP |
Information about download methods
Learn
- VoxForge tutorials: Build your own voice-interactive model.
- Speech Recognition Grammar Compilation in Grammatical Framework" (ACL Anthology Network, 2007): Learn to generate grammar-based language models for speech recognition systems from Grammatical Framework grammars.
- Speech Recognition Grammar Specification Version 1.0: Peruse the W3C syntax to represent grammars for speech recognition and to specify the words and word patterns listened for by a speech recognizer.
- Semantic Interpretation for Speech Recognition Version 1.0: Review the W3C syntax and semantics of semantic interpretation tags for speech recognition grammars.
- Stochastic
Language Models (N-Gram) Specification: Examine the W3Csyntax for representing N-Gram (Markovian) stochastic grammars. Stochastic grammars support large vocabulary and open vocabulary applications, and represent concepts or semantics.
- PHP bees and audio honey: Accessible agent-based audio alerts and feedback (Colin Beckingham, developerWorks, Oct 2009): Feed audio information to computer speakers in this PHP approach.
- Querying a database using open source voice control software (Colin Beckingham, Linux.com, May 2008): Check out the a successful attempt to query a database by voice and have the computer respond verbally.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. Also, read more XML tips.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- Julius: Explore a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers that is based on word N-gram and context-dependent HMM.
- HTK: Delve into this portable toolkit for building and manipulating hidden Markov models. Primarily used for speech recognition research, HTK is also used for research in speech synthesis, character recognition, and DNA sequencing.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks blogs: Check out these blogs and get involved.
Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.




