Many well-established software projects have used flat text configuration and resource files for several years without major issues. As projects grow and become more complex, the need increases for greater rigor and adaptability. With XML, and the application of XML using concrete standards, you can potentially benefit from: cross-project and cross-platform compatibility, robustness, and extensibility in areas such as Unicode.
By converting flat text files to the relevant open source standard, you can also increase flexibility and reliability. The lexicon in voice recognition work provides a good example that's used in this article. Whether or not your open source projects move to XML for resource files, you can employ XML standards in your work without loss of functions.
In this article, learn to easily move between flat and Pronunciation Lexicon Specification (PLS) formats. Examples show how to store customized lexicons in the PLS format and extract data into the required flat file.
Lexicons are lists of words that you use in speech recognition tools. They contain information on how the word must be printed or rendered graphically and how it sounds using phonemes. The lexicon in regular use with the Hidden Markov Model Toolkit (HTK) is widely used in voice control projects (see Resources). Listing 1 is an extract from a VoxForge HTK lexicon.
Listing 1. Extract from a VoxForge HTK lexicon
AGENCY [AGENCY] ey jh ih n s iy AGENDA [AGENDA] ax jh eh n d ax AGENT [AGENT] ey jh ih n t AGENTS [AGENTS] ey jh ih n t s AGER [AGER] ey g er AGES [AGES] ey jh ih z |
The file in Listing 1 consists of three tab-separated fields:
- The label that describes the word generally
- The square brackets that surround the word as you want it printed or shown on the screen (the grapheme)
- A sequence of single-space-separated phonemes from the Arpabet (see Resources) set that describe how the word sounds
In the example above, the pronunciations are from the English language, which is easily encompassed by American Standard Code for Information Interchange (ASCII) characters.
The CMU Sphinx project (see Resources) stores the lexicon (or dictionary in CMU Sphinx context) in a similar manner. Listing 2 shows an extract.
Listing 2. Extract from a CMU Sphinx lexicon
agency EY JH AH N S IY agenda AH JH EH N D AH agendas AH JH EH N D AH Z agent EY JH AH N T agents EY JH AH N T S ager EY JH ER |
In Listing 2 has only two fields: the word/grapheme and its phoneme rendering. The two example lexicons have some subtle differences:
- The words and phonemes are in different cases.
- The phonemes have some slight differences.
- Punctuation (comma, exclamation point, and so on) is treated slightly differently.
You can see the entire dictionary in the cmu07a.dic file in the current download of PocketSphinx (see Resources).
Because the lexicon describes specific pronunciations of words, you might need to edit the file to suit specific people or dialects. Over time, you build up knowledge capital in your own customized lexicon. It's easy to edit the flat file with a text editor, but it's also easy to introduce errors, such as: using a separator other than the standard for the file, inserting non-ASCII characters, putting the fields in the wrong order, sorting the records incorrectly, missing square brackets where required, and so on.
There's another subtle disadvantage of flat files. As you build your customized file, you remain incompatible with other speech projects. A lexicon in a standard XML format such as PLS, if recognized by both projects, is immediately compatible in both.
Pronunciation Lexicon Specification
The PLSA has a straightforward, basic format, as in Listing 3.
Listing 3. Basic PLS format
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme ...>
<grapheme>...</grapheme>
<phoneme ...>...</phoneme>
</lexeme>
</lexicon>
|
The XML describes the root element lexicon that can
contain multiple lexeme child elements. Each lexeme can contain multiple grapheme
elements and multiple phoneme elements. The
specification lets you override the alphabet attribute
but does not allow you to override the xml:lang
language attribute. To store lexemes for different languages, you strictly need
separate PLS lexicon files. The default alphabet in this lexicon is ipa, which refers to the International Phonetic Alphabet (IPA) system
of representing sounds (see Resources). The IPA
representations of phonemes are multibyte Unicode characters. Both HTK and Sphinx
use plain ASCII codes. This important consideration is addressed later in this article.
The advantage of using the PLS specification is that it adds more rigorous
structure and lets you store more information, such as part of speech and
specific alphabets. Part-of-speech detail is important in English because some
words that can be spelled identically (homographs) are pronounced differently
depending on their grammatical role. For example, perfect as an adjective
sounds different from the same word as a verb because the stress is in a
different place. The extra information stored in attributes allows you to extract specific records from the entire file, depending on need. Using this method, you can search for a specific alphabet among multiple phoneme elements.
Consider the PLS lexicon as a database of lexicon information from which you can extract detail relevant to the voice tool that you use. Listing 4 is an example in PLS format.
Listing 4. One word in PLS format
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en">
<lexeme role="noun">
<grapheme>agency</grapheme>
<phoneme alphabet="x-htk-voxforge">ey jh ih n s iy</phoneme>
<phoneme alphabet="x-cmusphinx">EY JH AH N S IY</phoneme>
</lexeme>
</lexicon>
|
The example in Listing 4 stores only one word that has two
possible phonemic representations. You can filter out one of the phoneme strings by using the alphabet attribute. The lexeme element shows the role attribute of noun. While this is informative, it is redundant in this case because the word is only used as a noun, with no complicating pronunciation scenarios.
By placing the phoneme representations from two
different sources side by side, you can already discern subtle differences. This information could be helpful in resolving speech recognition problems.
Neither CMU Sphinx nor HTK can use a PLS lexicon directly, but the simon (see Resources) front end to the HTK toolkit can. If you're using straight HTK or Sphinx, you need to be sure you can get easily from plain text to PLS and back again without loss of information.
The following sections show how to use Python to get from a flat file to PLS and back to a flat file. It is assumed that you have customized information in a flat lexicon file.
The code in Listing 5 uses Python, but you can accomplish the same thing many other ways. (For an example, see the developerWorks tutorial on Extensible Stylesheet Language Transformations (XSLT) in Resources.) Some people want to use libraries that check the robustness of the XML at each minor step for more immediate feedback on where the problem is, especially if source files are large and liable to contain errors and inconsistencies. The example below leaves the checking to the last step, which implies a certain level of confidence that the flat files are in good shape.
Listing 5. Conversion to PLS
from elementtree.ElementTree import parse
import string as str
import sys
import cgi
#
# call with
# python flat2pls.py vox
# or
# python flat2pls.py spx
#
if len(sys.argv) == 2:
src = sys.argv[1]
else:
exit("wrong args")
#
outfile = "mylex"+src+".pls"
print "out is "+outfile
out = open(outfile,"w")
out.write('<?xml version="1.0" encoding="UTF-8"?>\n\
<lexicon version="1.0"\n \
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"\n\
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n \
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon\n \
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"\n\
alphabet="ipa" xml:lang="en">')
# now the lexemes
if src == "vox":
f = open("vf.lex","r")
for line in f:
line = str.strip(line)
word = str.split(line,"\t")
#gr = str.strip(word[1],"[]")
gr = cgi.escape(word[0])
out.write('\n\
<lexeme>\n\
<grapheme>'+gr+'</grapheme>\n\
<phoneme alphabet="x-htk-voxforge">'+word[2]+'</phoneme>\n\
</lexeme>')
else: # src is sphinx
f = open("cmu.dic","r")
for line in f:
line = str.strip(line)
word = str.split(line,"\t")
gr = cgi.escape(word[0])
out.write('\n\
<lexeme>\n\
<grapheme>'+gr+'</grapheme>\n\
<phoneme alphabet="x-cmusphinx">'+word[1]+'</phoneme>\n\
</lexeme>')
# ended lexemes
out.write('\n</lexicon>\n')
out.close()
# now check the output is ok
tree = parse(outfile)
lexicon = tree.getroot()
mylexcount = 0
for lexeme in lexicon:
mylexcount += 1
print 'Found %(number)d lexemes' % {"number":mylexcount}
|
Listing 5 begins by importing modules from the XML
parsing library elementtree (see Resources) and some supporting libraries. Importing
ElementTree on different distributions can involve a slightly different syntax,
depending on how you install the module. The example code is from openSUSE with
the module installed from source, but Ubuntu might require from xml.etree.ElementTree import parse. The module str allows some string manipulations, sys
gives you access to files, and cgi provides an
all-important escaping routine essential in handling data for XML. The code
expects to get a command-line interface (CLI) argument telling it whether to
translate from CMU Sphinx format or HTK/VoxForge. The example code then opens the file
for output and writes the XML prologue suitable for PLS. Because you're not storing
any Unicode characters at this stage, it's sufficient to open the file for plain ASCII access only.
At this point, the code in Listing 5:
- Processes the source file line by line, splitting the fields into separate strings and writing the
lexeme,grapheme, andphonemecomponents - Identifies the
phonemewith attributealphabet="x-htk-voxforge"if the incoming data is from the VoxForge lexicon and withalphabet="x-cmusphinx"if the data is from Sphinx - Retains the case of the phonemes
When the first field is imported it may contain characters, such as ampersand (&), that cause problems in the XML unless escaped with cgi.escape().
Finally, the code:
- Writes the closing tags
- Closes the PLS file and then reloads it as an XML file
- Reads through the file counting
lexemeelements - Reports the count of lexemes
If the count is reported, then the XML appears to be robust and well-formed.
Listing 6 is an extract of the result from VoxForge HTK lexicon.
Listing 6. Extract of result from VoxForge HTK lexicon
... <lexeme> <grapheme>AGENDA</grapheme> <phoneme alphabet="x-htk-voxforge">ax jh eh n d ax</phoneme> </lexeme> <lexeme> <grapheme>AGENT</grapheme> <phoneme alphabet="x-htk-voxforge">ey jh ih n t</phoneme> </lexeme> ... |
It's important to know that you can easily get back from PLS format to a flat file. The code in Listing 7 assumes that you have your lexicon stored in a PLS-formatted file and that your speech recognition project can only use a flat file in either HTK or CMU Sphinx format.
Listing 7. Conversion from PLS
from elementtree.ElementTree import parse
import string as str
import sys
#
# call with
# python pls2flat.py x-htk-voxforge > mylexicon
# or
# python pls2flat.py x-cmusphinx > mylexicon.dic
#
if len(sys.argv) > 1:
alpha = sys.argv[1]
#
if alpha == "x-htk-voxforge":
tree = parse("mylexvox.pls")
else:
tree = parse("mylexspx.pls")
lexicon = tree.getroot()
for lexeme in lexicon:
for child in lexeme:
#print child.tag
if child.tag[-8:] == 'grapheme':
if alpha == 'x-htk-voxforge':
gr = str.upper(child.text)
print gr,"\t","["+gr+"]","\t",
else:
gr = child.text
print gr,"\t",
if child.tag[-7:] == 'phoneme':
if child.get('alphabet') == alpha:
print child.text
|
This short script uses the elementtree library to parse
the PLS XML file. It establishes the root element and then iterates over the
child lexemes, looking for the grapheme and phoneme, and writes the values to a text file in the relevant format.
The script asks for the last eight characters in the tag for grapheme since there will be a namespace prefix returned with the tag. It recreates the three-field format for HTK and two fields for CMU Sphinx.
Merging and dealing with Unicode
The script in Listing 8 uses two PLS files
to create one common PLS file that contains information for both of the original
files. It also translates the VoxForge phoneme string
to Unicode and stores the Unicode version in a separate phoneme element identified with the alphabet="ipa" attribute.
Listing 8. Merging and Unicode
#! /usr/bin/python -u
# -*- coding: utf-8 -*-
#
# challenge is to merge two pls files
# given two pls files, merge them into one
#
import elementtree.ElementTree as ET
from elementtree.ElementTree import parse
import string as str
import codecs
import cgi
#
treevox = ET.parse("mylexvox.pls")
treespx = ET.parse("mylexspx.pls")
#
lexvox = treevox.getroot()
lexspx = treespx.getroot()
#
phons = { 'aa':u'ɑ','ae':u'æ','ah':u'ʌ','ao':u'ɔ','ar':u'ɛr','aw':u'aʊ',
'ax':u'ə','ay':u'aɪ','b':u'b','ch':u'tʃ','d':u'd','dh':u'ð','eh':u'ɛ',
'el':u'ɔl','en':u'ɑn','er':u'ər','ey':u'eɪ','f':u'f',
'g':u'ɡ','hh':u'h','ih':u'ɪ','ir':u'ɪr','iy':u'i','jh':u'dʒ','k':u'k','l':u'l',
'm':u'm','n':u'n','ng':u'ŋ','ow':u'oʊ','oy':u'ɔɪ','p':u'p','r':u'r','s':u's',
'sh':u'ʃ','t':u't','th':u'θ','uh':u'ʊ','ur':u'ʊr','uw':u'u','v':u'v',
'w':u'w','y':u'j','z':u'z','zh':u'ʒ','sil':'' }
#
def to_utf(s):
myp = str.split(s)
myipa = []
for p in myp:
myipa.append(phons[p])
return str.join(myipa,'')
#
outfile = "my2lexmrg.pls"
out = codecs.open(outfile, encoding='utf-8', mode='w')
#
out.write('<?xml version="1.0" encoding="UTF-8"?>\n\
<lexicon version="1.0"\n \
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"\n\
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n \
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon\n \
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"\n\
alphabet="ipa" xml:lang="en">')
#
# scan the two pls, create dictionary
voxdict = {}
for lexeme in lexvox:
gr = str.lower(lexeme[0].text)
ph = lexeme[1].text
voxdict[gr] = {ph,}
#
for lexeme in lexspx:
gr = lexeme[0].text
ph = lexeme[1].text
if gr in voxdict:
voxdict[gr].add(ph)
else:
voxdict[gr] = {ph,}
#
for gr in sorted(voxdict.iterkeys()):
out.write('\n\
<lexeme>\n\
<grapheme>'+cgi.escape(gr)+'</grapheme>')
#print "%s: %s" % (key, voxdict[key])
for ph in sorted(voxdict[gr]):
alph = 'x-htk-voxforge' if ph.islower() else 'x-cmusphinx'
out.write('\n\
<phoneme alphabet="'+alph+'">'+ph+'</phoneme>')
if ph.islower():
phipa = to_utf(ph)
out.write(u'\n\
<phoneme alphabet="ipa">'+phipa+'</phoneme>')
out.write('\n\
</lexeme>')
# done, close files
out.write('\n</lexicon>')
out.close()
# now check the output is ok
tree = parse(outfile)
lexicon = tree.getroot()
mylexcount = 0
for lexeme in lexicon:
mylexcount += 1
print 'Found %(number)d lexemes' % {"number":mylexcount}
|
You begin with a hashbang (#!) expression followed by a special indicator
to the Python interpreter, on the second line, that this code contains Unicode
characters. The script then imports a number of modules, including elementtree, codecs, and cgi, that are useful when dealing with Unicode. You tell the interpreter where the two PLS files are and point to their root elements.
The phons variable stores a special dictionary that
contains a mapping from the CMU Arpabet codes to an equivalent Unicode
combination. This dictionary translates existing phoneme strings to a Unicode version. Feel free to modify the mapping
for your own purposes—for example, you might feel that the equivalent of
'aa' in Unicode is u'ɑ:',
which lengthens the a sound.
A single defined function, to_utf(), that
translates an ASCII Arpabet string to Unicode. The final part of the foundation
is to open a file for storage of the output, making sure that this file knows it
must be prepared to accept Unicode, and to write the PLS prologue into it.
Now you're ready to process the files by creating two special internal Python
dictionaries, one from each of the PLS files, by scanning them with the elementtree library. It is assumed that the grapheme will be the first child and the phoneme the second child of the lexeme. The script adds all of the records from the first file to a new merge dictionary. When scanning the second file, if a key already exists in the new merge dictionary you add to its set of phonemes. If not, you create a new key item in the merged dictionary. At the end of the loop, the new merged dictionary contains keys from both the original files and an associated set of one or two phoneme strings.
Write the new PLS file from the merge you just created. As you scan through the
dictionary, you add the alphabet attribute to
distinguish one phoneme from another. Having written
out the existing phonemes, you create a new phoneme
string that is the Unicode equivalent of the CMU Arpabet string, which you can get from the HTK or the Sphinx version (or both) according to your needs.
Finally, close out the root element, close the file, and parse it again as you did before to check that it is well formed.
The result should look something like Listing 9.
Listing 9. Merge results
...
<lexeme>
<grapheme>agenda</grapheme>
<phoneme alphabet="x-cmusphinx">AH JH EH N D AH</phoneme>
<phoneme alphabet="x-htk-voxforge">ax jh eh n d ax</phoneme>
<phoneme alphabet="ipa">ədʒɛndə</phoneme>
</lexeme>
<lexeme>
<grapheme>agendas</grapheme>
<phoneme alphabet="x-cmusphinx">AH JH EH N D AH Z</phoneme>
</lexeme>
<lexeme>
<grapheme>agent</grapheme>
<phoneme alphabet="x-cmusphinx">EY JH AH N T</phoneme>
<phoneme alphabet="x-htk-voxforge">ey jh ih n t</phoneme>
<phoneme alphabet="ipa">eɪdʒɪnt</phoneme>
</lexeme>
<lexeme>
<grapheme>agent's</grapheme>
<phoneme alphabet="x-cmusphinx">EY JH AH N T S</phoneme>
</lexeme>
...
|
With the merged PLS dictionary, you can apply Extensible Stylesheet Language (XSL) or any other procedures to generate the results you need, whether flat file or a new specific PLS file. In theory, you can store other phoneme strings in this file as well, even from other languages. However, that is a nonstandard use of the PLS specification.
Kai Schott has done a lot of work with PLS and has several already prepared files for download in different languages, particularly German (see Resources).
Though you can get a lot of information from the flat files, the following issues remain unresolved.
- Selecting from multiple graphemes
- On occasion, you need to deal with multiple spellings for a word in the same language. The role and the phoneme are identical for both orthographies, so you have multiple graphemes in the same lexeme. However, there is nothing in the PLS that allows you to add an attribute to the
graphemeelement as there is with thealphabetattribute of thephonemeelement. - Acronyms
- Lexicons frequently contain acronyms. The PLS deals with these in a child element of
lexemecalled an<alias>.To build a PLS automatically from a flat file, you need a way to distinguish the acronyms from the real words in the lexicon. The flat files don't necessarily have that information. - Role/part of speech
- As with acronyms, part-of-speech information is not available from the flat files to build
roleattributes into the PLS.
In this article, you learned to move quite easily between flat and PLS formats. Storing lexicons in PLS format has potential advantages. Open source projects may or may not move to XML for resource files. That is a decision for project managers to make—they alone can assess resources and apply them in the best way generally accepted in their communities.
In the meantime, you can employ XML standards in your own work without loss of functions. For lexicons and dictionaries used in voice recognition work, you can increase cross-project adaptability, robustness, and utility by storing customized lexicons in the PLS format. You can easily extract data as needed into the required flat file.
Learn
- W3C Pronunciation Lexicon Specification (PLS): Get more information on lexemes, graphemes, and phonemes.
- International Phonetic Alphabet: Read more on Wikipedia about this alphabetic system of phonetic notation, including the Arpabet and Unicode equivalents.
- HTK: Learn about the Hidden Markov Model Toolkit (HTK) portable toolkit for building and manipulating hidden Markov models.
- CMU Sphinx: Explore the open source toolkit for speech recognition from Carnegie Mellon University.
- ElementTree: Read more about the ElementTree XML library for Python.
- simon: Learn more about this speech recognition platform.
- Testing simon, Kai Schott's blog: Find several already prepared PLS files for various languages.
- Analyze with XSLT, Part 1: Analyze non-XML data with XSLT (Chuck White, developerWorks, December 2003): Provides information about using XSLT to work with XML.
- More articles by this author (Colin Beckingham, developerWorks, March 2009-current): Read articles about XML, voice recognition, XQuery, PHP, and other technologies.
- New to XML? Get the resources you need to learn XML.
- XML area on developerWorks: Find the resources you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- VoxForge: Create acoustic models and find out more about putting together a speech recognition model.
- CMU Sphinx: Get the speech recognition toolkit.
- Python: Get more information about this programming language, including download links.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- developerWorks profile: Create your profile today and set up a watchlist.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



