The Carnegie Mellon University Sphinx project creates open source speech-recognition tools for developers and users. This article uses the Sphinx-4 code base to provide automatic recognition of a very small dictionary of common letters and numbers. Converting this spoken information to text and processing the strings for certain data structures, such as phone numbers and acronyms, allows for the creation of an automated descriptive annotation of verbal conversations.
One of the more useful areas in which to implement this project is in a teleconference annotation application. Next time you join the developmental meeting, fire up your conversation annotator, and you can have automatic lookups of individuals based on their phone numbers when spoken in the meeting, or see what the acronym of the day is according to a Web search engine. You won't have to stop what you are doing to enter in the latest acronym or employee serial number mentioned in the meeting to find out the associated data. Sphinx-4 and the conversation annotator we build here can take care of a large portion of the drudgery for you.
Sphinx is very resource-intensive, and, as a result, you will need fast hardware to make the software useful. A large heap of dedicated memory is required for useful performance, so plan on running the Sphinx application on an Intel® Pentium® 4-class machine with at least 1 GB of RAM. By contrast, the text-processing hardware requirements are negligible and can be run on the same machine without affecting the performance of the speech-recognition processing.
You can run the applications we create in this article on hardware running Linux® or Microsoft® Windows®. Sphinx-4 depends on a recent JDK and Apache Ant to create a custom grammar processor. We need Perl and the associated lookup modules of your choice. See the Resources section for links to learn more about and download the software packages mentioned.
Sphinx comes in many forms for various types and capabilities of speech recognition. This article makes use of the Sphinx-4 package, which is the most user- and developer-friendly of the recent releases. Installing Sphinx-4 can be intimidating, so consider the following steps highlighted from the installation instructions:
- Download and extract Apache Ant.
- Download and extract the Sun JDK (as of this writing, V1.6.0_02 appears to be the current release).
- Download and extract the Sphinx-4 source package because we'll be modifying one of the demo programs to suit our purposes.
- Set up your environment variables with the following commands:
export ANT_HOME=${PWD}/apache-ant-1.7.0 export JAVA_HOME=${PWD}/jdk1.6.0_02 export PATH=${PATH}:${ANT_HOME}/bin
On Windows, you may need to set up your environment variables under Control Panel > System > Advanced > Environment variables. - Change to the sphinx4-beta directory, then to the lib subdirectory.
- Activate the JSAPI binary license by running the jsapi.sh shell script. Sphinx-4 provides support for JSAPI with a binary license, so you'll need to accept the agreement.
- You may be asked to install uudecode to unpack the components that JSAPI requires. Most Linux distributions have a package that includes uudecode in some form, so consider checking your available packages first if a uudecode installation is required. On Windows, double-click the jsapi.exe file and accept the license agreement.
- Back out and change to the main Sphinx4 directory.
- Run the command
ant, and the build process should begin.
The status message of "BUILD SUCCESSFUL" means you've got your environment set up correctly and you're ready to move on to modification steps. If you receive a different message, check your build directory and environment variables or consult the Apache Ant and Sphinx-4 documentation for detailed installation instructions for your environment.
Strategy for extracting letters and numbers from speaker-independent voices
Speech recognition is a technology that always seems two to 10 years away from speaker-independent recognition of a large vocabulary. Annotating a meeting with multiple voices, including overlapping speech, globally influenced accents, and a broad range of technical and colloquial vocabularies, is nearly impossible for any consumer-level software available on the market. Sphinx and specifically the Sphinx-4 package delivers all the options we need to reliably recognize a very small (yet useful) vocabulary in a speaker-independent context.
We've already specified our limited vocabulary: the letters A-Z and numbers 0-9. Our strategy is to simply extract any location where these letters or numbers are uttered. A common description for this approach is word spotting. Although Sphinx-4 does not currently support word spotting, we can still achieve useful results by forcing all utterances to match at least one of the words in the grammar. Once we have this list of best-guess letters and numbers, we can apply standard text-processing tools and informational lookups to extract useful information.
Custom dictionary, modification of Hello World example
The first step in creating the pseudo word-spotting setup is to build the desired
dictionary file. In the Sphinx-4 directory tree, there is a directory called
bld/models/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/.
This directory contains the alpha.dict and digit.dict dictionary files. At first
glance, it appears that combining these two dictionary files will produce the desired
file. This is not the case, however, as we'll need to build our dictionary files from
the cmudict.0.6d file in the same directory.
Change to the bld/models/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/
directory and issue these commands to build the desired dictionary file:
perl -ne 'print if( /^[A-Z]\ / )' cmudict* > alN.dict perl -ne 'print if(/^(ZERO|ONE|TWO|THREE|FOUR)[ (]/)' cmudict* >> alN.dict perl -ne 'print if(/^(FIVE|SIX|SEVEN|EIGHT|NINE)[ (]/)' cmudict* >> alN.dict |
Listing 1 shows the alN.dict file as created with simple letters and numbers as the sole part of the dictionary.
Listing 1. Snippet from alN.dict dictionary file
...
W D AH B AH L Y UW
X EH K S
Y W AY
Z Z IY
FOUR F AO R
ONE HH W AH N
ONE(2) W AH N
THREE TH R IY
...
|
Modification of Hello World example
Sphinx-4 provides many configuration options to meet almost any need in the field of
speech recognition. For our purposes, the most efficient approach is to simply modify
the existing Hello World example. Under the Sphinx-4 root directory, change to the
demo/sphinx/helloworld directory and edit the
helloworld.config.xml file. Listing 2 shows the one line of change required to use the
alN.dict dictionary file we built.
Listing 2. helloworld.config.xml changes
original (line 114):
<property name="dictionaryPath"
value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.
Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13d
Cep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d"/>
new:
<property name="dictionaryPath"
value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.
Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13d
Cep_16k_40mel_130Hz_6800Hz/dict/alN.dict"/>
|
Modifications are also necessary to the hello.gram grammar file in the same directory. Listing 3 shows the changes required to pick up just the letters and numbers in our dictionary file.
Listing 3. hello.gram changes
original:
public <greet> = (Good morning | Hello)
( Bhiksha | Evandro | Paul | Philip | Rita | Will );
new:
public <greet> = ( zero | one | two | three | four | five | six |seven | eight | nine |
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v |
w | x | y | z) * ;
|
You'll also need to make a cosmetic change to the HelloWorld.java file, as shown below.
Listing 4. HelloWorld.java change
original (line 59):
System.out.println
("Say: (Good morning | Hello) " +
"( Bhiksha | Evandro | Paul | Philip | Rita | Will )");
new:
System.out.println
("Listening for letters and numbers");
|
With the above changes in place, you can build and run the modified example. Change the
Sphinx-4 home directory and issue the command ant (the same
"BUILD SUCCESSFUL" message will let you know if your changes were correct). Run the
updated example with the command $JAVA_HOME/bin/java -mx312m -jar
bin/HelloWorld.jar (on Linux). The command for Windows is:java -mx312m
-jar bin/HelloWorld.jar. Speak this sentence: "The phone
number for IBM tech support is one eight zero zero four two six seven three seven
eight," and you should see output like that shown below:
f o nine r four i b m x a four t one eight zero zero four two six seven three seven eight |
As you can see, the sentence uttered will be processed for letters and numbers semi-correctly. The letters "IBM" and the numbers in the phone number are recognized correctly, but the remainder of the words are incorrectly categorized as various letters and numbers that are the best match for a particular sound.
You may be asking yourself: Why not simply use a multithousand-word dictionary to recognize those incorrect best guesses? After all, Sphinx-4 provides large vocabulary dictionaries and language models. Why not simply configure the demonstration example to recognize the remaining: "The phone number for tech support is" and any other words that may be uttered?
The answer is because Sphinx-4 is good, but not perfect. Expanding the dictionary file to recognize hundreds of thousands of words will drastically reduce the effectiveness of the simple number and letter matching. You can test this yourself by checking some of the other programs in the Sphinx-4 "demo" directory or by modifying the existing examples to use large dictionary files and expanded grammar lists. Post-processing the text of only letters and numbers for higher-order data is a much easier method of developing a useful annotation system with available open source systems.
With two simple rules, extracting acronyms and phone numbers from the output text becomes relatively simple: Any three consecutive letters are considered an acronym, and any five or more digits together are considered a phone number. Listings 5, 6, and 7 show the components of the annotateAcrNum.pl program that perform these extractions and lookups:
Listing 5. annotateAcrNum.pl part 1 — Main program logic
#!/usr/bin/perl -w
# annotateAcrNum.pl - extract and lookup acronyms and numbers from speech
# recognition text output
use strict;
use Yahoo::Search;
use Net::Dict;
$|=1; # non buffered output for better user feedback
my %numHash =
("zero" => "0",
"one" => "1",
"two" => "2",
"three" => "3",
"four" => "4",
"five" => "5",
"six" => "6",
"seven" => "7",
"eight" => "8",
"nine" => "9" );
while( my $line = <STDIN> )
{
print "$line" if( $line =~ /(Start|You said:)/ );
next unless ( $line =~ /You said:/ );
my @words = split " ", substr($line,9); # ignore the "You said:" prefix
my @numArr = ();
my @letArr = ();
foreach my $chunk ( @words )
{
if( length($chunk) == 1 )
{
phoneNmSearch(@numArr) if( @numArr > 4 );
@numArr = ();
push @letArr, $chunk;
if( @letArr > 2 )
{
acronymSearch( @letArr );
shift( @letArr );
}
}elsif( length($chunk) > 1 )
{
push @numArr, $numHash{$chunk};
@letArr = ();
}#if length greater
}#for each word
phoneNmSearch( @numArr ) if( @numArr > 4 );
acronymSearch( @letArr ) if( @letArr > 2 );
}#while stdin
|
The main program logic above searches for letter and number strings matching our simplistic criteria. For each line of speech-recognition text output by the Hello World modified code, build separate arrays of letters and numbers only. The letters array is searched using the acronymSearch subroutine described below. Note that the letters array is shifted after each acronym lookup in order to search for both "ibm" and "bmx" from the string "i b m x." The numbers array does not perform this same position shift, instead taking the largest number it can find and performing a Web search.
Listing 6. annotateAcrNum.pl part 2 — acronymSearch
sub acronymSearch
{
my $dict = Net::Dict->new('dict.org');
my $str = @_; $str =~ s/ //g;
my $eref = $dict->define($str);
next if ($eref eq "" );
foreach my $entry (@$eref)
{
my ($db, $definition) = @$entry;
next if ( !(defined($definition)) || !(defined($db)) );
if( $db =~ /(wn|vera|gazetteer|foldoc)/ ){ print "$db: $definition\n" }
}#for each definition
}#acronymSearch
|
Subroutine acronymSearch makes use of the helpful Net::Dict module.
Simply specify a dictionary server and a query to look up in the large variety of
databases available. Regular expression /(wn|vera|gazetteer|foldoc)/ limits the printout to those databases
that provide relatively terse descriptions. You may find that your acronym space is
better represented by other databases available at dict.org, requiring removal of this
regular expression limiter.
Listing 7. annotateAcrNum.pl part 3 — phoneNmSearch
sub phoneNmSearch
{
my $str = @_; $str =~ s/ //g;
if( length($str) == 11 )
{
$str =~ /(\d)(\d\d\d)(\d\d\d)(\d\d\d\d)/;
$str = "$1-$2-$3-$4\n";
}elsif( length($str) == 10 )
{
$str =~ /(\d\d\d)(\d\d\d)(\d\d\d\d)/;
$str = "$1-$2-$3\n";
}elsif( length($str) == 7 )
{
$str =~ /(\d\d\d)(\d\d\d\d)/;
$str = "$1-$2\n";
}
print "Results for: $str\n";
my @results = Yahoo::Search->Results(Doc => "$str", AppId => "PhNmLookup" );
warn $@ if $@; # report any errors
my $recCount = 0;
for my $res (@results)
{
print "Title: ", $res->Title, " \n";
print $res->Summary, "\n";
print $res->Url, "\n";
print "\n";
last if( $recCount > 1 ); # print first 3 results only
$recCount++;
}#for each result
}#phoneNmSearch
|
For certain search engines, drastically more accurate search results can be attained by the addition of formatting to the phone number digits. For example, changing 18004267378 to 1-800-426-7378 or 4152042 into 415-2042 is performed by the first portion of the phoneNmSearch subroutine. This slightly modified phone number is then used as the query in a Yahoo! search parameter using Jeffrey Friedl's handy Yahoo::Search Perl module.
With your custom Sphinx-4 speech recognition and the annotateAcrNum Perl program,
you're ready to start annotating spoken conversations. Run the annotator with the
command $JAVA_HOME/bin/java -mx312m -jar bin/HelloWorld.jar | perl
annotateAcrNum.pl (on Linux). For Windows, the command is
java -mx312m -jar bin/HelloWorld.jar | perl
annotateAcrNum.pl.
Figure 1 shows the output of the annotator setup in "Terminal" on Vector Linux. Note the underlined link text available to launch pages based on the Web search results.
Figure 1. Conversation annotator screenshot in Terminal on Vector Linux
The types of queries and databases chosen to search in this article are just general examples of useful annotation. You may find that using Google for your Web search lookups is more effective, or you can link your phone number lookups to your employer's address book. Equal options are available for the methods chosen to extract higher-order data from the recognized letters and numbers. Perhaps your conversations focus more on IP addresses or employee serial numbers. Using some of the techniques described, you can extract your dotted quads and unique identifiers, and link the lookups to your own databases.
Sphinx-4 also provides many options for enhancing the effectiveness of speech recognition. Consider creating your own trained acoustic models for you and members of your team to provide a much higher accuracy rate. Expand the dictionary file to include tens of thousands of commonly spoken words and test Sphinx-4's real time transcription qualities.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample code | os-sphinxspeechrecAnnotations_0.1.zip | 4KB | HTTP |
Information about download methods
Learn
-
Read the article "Enable C++
applications for Web services using XML-RPC" for a step-by-step guide to exposing C++
methods as services.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Learn more about and download Sphinx-4.
-
Learn more about and grab Apache Ant to build Sphinx-4.
-
The JDK file used for this
project was from the SE binary
extractor V1.6.0_02
-
The annotating Perl script makes use of the Net::Dict and Yahoo::Search modules at CPAN.
-
If your system doesn't have Perl, get it from Perl.org.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.
Comments (Undergoing maintenance)






