Move toward open source standards in speech processing

Convert flat lexicon files to XML with Python

Many open source projects began before the advent of free and open source software (FOSS) standards, so their configuration and resource files are simple flat text files. By converting these files to the relevant open source standard, you potentially increase cross-project compatibility, flexibility, and reliability. The lexicon in voice recognition work is a good example. In this article, learn to use Python to convert existing flat lexicons to the XML format defined in the Pronunciation Lexicon Specification (PLS) and how to convert the new PLS file back to a flat file. Explore how to use the XML format to add extra information and rigor to the maintenance of lexicons. Issues such as Unicode, and merging the new lexicon with other XML files while still using data in audio model generation, are also addressed.

Colin Beckingham, Writer and Researcher, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



07 August 2012

Also available in Chinese Russian Japanese Vietnamese Portuguese

Introduction

Many well-established software projects have used flat text configuration and resource files for several years without major issues. As projects grow and become more complex, the need increases for greater rigor and adaptability. With XML, and the application of XML using concrete standards, you can potentially benefit from: cross-project and cross-platform compatibility, robustness, and extensibility in areas such as Unicode.

Frequently used abbreviations

  • HTK: Hidden Markov Model Toolkit
  • PLS: Pronunciation Lexicon Specification
  • XML: eXtensilble Markup Language

By converting flat text files to the relevant open source standard, you can also increase flexibility and reliability. The lexicon in voice recognition work provides a good example that's used in this article. Whether or not your open source projects move to XML for resource files, you can employ XML standards in your work without loss of functions.

In this article, learn to easily move between flat and Pronunciation Lexicon Specification (PLS) formats. Examples show how to store customized lexicons in the PLS format and extract data into the required flat file.


Example: The lexicon

Lexicons are lists of words that you use in speech recognition tools. They contain information on how the word must be printed or rendered graphically and how it sounds using phonemes. The lexicon in regular use with the Hidden Markov Model Toolkit (HTK) is widely used in voice control projects (see Resources). Listing 1 is an extract from a VoxForge HTK lexicon.

Listing 1. Extract from a VoxForge HTK lexicon
AGENCY  [AGENCY]        ey jh ih n s iy
AGENDA  [AGENDA]        ax jh eh n d ax
AGENT   [AGENT] ey jh ih n t
AGENTS  [AGENTS]        ey jh ih n t s
AGER    [AGER]  ey g er
AGES    [AGES]  ey jh ih z

Add single tab if you copy and paste code in article

It is recommended that you get the lexicons directly from the source. This article displays in HTML, which replaces the tab separations with spaces. If you copy and paste from this article, you need to replace the multiple intervening spaces with a single tab separator (\t) or the script will fail.

The file in Listing 1 consists of three tab-separated fields:

  • The label that describes the word generally
  • The square brackets that surround the word as you want it printed or shown on the screen (the grapheme)
  • A sequence of single-space-separated phonemes from the Arpabet (see Resources) set that describe how the word sounds

In the example above, the pronunciations are from the English language, which is easily encompassed by American Standard Code for Information Interchange (ASCII) characters.

The CMU Sphinx project (see Resources) stores the lexicon (or dictionary in CMU Sphinx context) in a similar manner. Listing 2 shows an extract.

Listing 2. Extract from a CMU Sphinx lexicon
agency  EY JH AH N S IY
agenda  AH JH EH N D AH
agendas AH JH EH N D AH Z
agent   EY JH AH N T
agents  EY JH AH N T S
ager    EY JH ER

In Listing 2 has only two fields: the word/grapheme and its phoneme rendering. The two example lexicons have some subtle differences:

  • The words and phonemes are in different cases.
  • The phonemes have some slight differences.
  • Punctuation (comma, exclamation point, and so on) is treated slightly differently.

You can see the entire dictionary in the cmu07a.dic file in the current download of PocketSphinx (see Resources).

Because the lexicon describes specific pronunciations of words, you might need to edit the file to suit specific people or dialects. Over time, you build up knowledge capital in your own customized lexicon. It's easy to edit the flat file with a text editor, but it's also easy to introduce errors, such as: using a separator other than the standard for the file, inserting non-ASCII characters, putting the fields in the wrong order, sorting the records incorrectly, missing square brackets where required, and so on.

There's another subtle disadvantage of flat files. As you build your customized file, you remain incompatible with other speech projects. A lexicon in a standard XML format such as PLS, if recognized by both projects, is immediately compatible in both.


Pronunciation Lexicon Specification

The PLSA has a straightforward, basic format, as in Listing 3.

Listing 3. Basic PLS format
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa" xml:lang="en-US">
  <lexeme ...>
    <grapheme>...</grapheme>
    <phoneme ...>...</phoneme>
  </lexeme>
</lexicon>

The XML describes the root element lexicon that can contain multiple lexeme child elements. Each lexeme can contain multiple grapheme elements and multiple phoneme elements. The specification lets you override the alphabet attribute but does not allow you to override the xml:lang language attribute. To store lexemes for different languages, you strictly need separate PLS lexicon files. The default alphabet in this lexicon is ipa, which refers to the International Phonetic Alphabet (IPA) system of representing sounds (see Resources). The IPA representations of phonemes are multibyte Unicode characters. Both HTK and Sphinx use plain ASCII codes. This important consideration is addressed later in this article.

The advantage of using the PLS specification is that it adds more rigorous structure and lets you store more information, such as part of speech and specific alphabets. Part-of-speech detail is important in English because some words that can be spelled identically (homographs) are pronounced differently depending on their grammatical role. For example, perfect as an adjective sounds different from the same word as a verb because the stress is in a different place. The extra information stored in attributes allows you to extract specific records from the entire file, depending on need. Using this method, you can search for a specific alphabet among multiple phoneme elements.

Consider the PLS lexicon as a database of lexicon information from which you can extract detail relevant to the voice tool that you use. Listing 4 is an example in PLS format.

Listing 4. One word in PLS format
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa" xml:lang="en">
  <lexeme role="noun">
    <grapheme>agency</grapheme>
    <phoneme alphabet="x-htk-voxforge">ey jh ih n s iy</phoneme>
    <phoneme alphabet="x-cmusphinx">EY JH AH N S IY</phoneme>
  </lexeme>
</lexicon>

The example in Listing 4 stores only one word that has two possible phonemic representations. You can filter out one of the phoneme strings by using the alphabet attribute. The lexeme element shows the role attribute of noun. While this is informative, it is redundant in this case because the word is only used as a noun, with no complicating pronunciation scenarios.

By placing the phoneme representations from two different sources side by side, you can already discern subtle differences. This information could be helpful in resolving speech recognition problems.

Neither CMU Sphinx nor HTK can use a PLS lexicon directly, but the simon (see Resources) front end to the HTK toolkit can. If you're using straight HTK or Sphinx, you need to be sure you can get easily from plain text to PLS and back again without loss of information.

The following sections show how to use Python to get from a flat file to PLS and back to a flat file. It is assumed that you have customized information in a flat lexicon file.


Conversion to PLS

The code in Listing 5 uses Python, but you can accomplish the same thing many other ways. (For an example, see the developerWorks tutorial on Extensible Stylesheet Language Transformations (XSLT) in Resources.) Some people want to use libraries that check the robustness of the XML at each minor step for more immediate feedback on where the problem is, especially if source files are large and liable to contain errors and inconsistencies. The example below leaves the checking to the last step, which implies a certain level of confidence that the flat files are in good shape.

Listing 5. Conversion to PLS
from elementtree.ElementTree import parse
import string as str
import sys
import cgi
#
# call with 
#	python flat2pls.py vox
# or 
#	python flat2pls.py spx
#
if len(sys.argv) == 2:
  src = sys.argv[1]
else:
  exit("wrong args")
#
outfile = "mylex"+src+".pls"
print "out is "+outfile
out = open(outfile,"w")
out.write('<?xml version="1.0" encoding="UTF-8"?>\n\
<lexicon version="1.0"\n \
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"\n\
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n \
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon\n \
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"\n\
      alphabet="ipa" xml:lang="en">')
# now the lexemes
if src == "vox":
  f = open("vf.lex","r")
  for line in f:
    line = str.strip(line)
    word = str.split(line,"\t")
    #gr = str.strip(word[1],"[]")
    gr = cgi.escape(word[0])
    out.write('\n\
  <lexeme>\n\
    <grapheme>'+gr+'</grapheme>\n\
    <phoneme alphabet="x-htk-voxforge">'+word[2]+'</phoneme>\n\
  </lexeme>')
else: # src is sphinx
  f = open("cmu.dic","r")
  for line in f:
    line = str.strip(line)
    word = str.split(line,"\t")
    gr = cgi.escape(word[0])
    out.write('\n\
  <lexeme>\n\
    <grapheme>'+gr+'</grapheme>\n\
    <phoneme alphabet="x-cmusphinx">'+word[1]+'</phoneme>\n\
  </lexeme>')
# ended lexemes
out.write('\n</lexicon>\n')
out.close()
# now check the output is ok
tree = parse(outfile)
lexicon = tree.getroot()
mylexcount = 0
for lexeme in lexicon:
  mylexcount += 1
print 'Found %(number)d lexemes' % {"number":mylexcount}

Listing 5 begins by importing modules from the XML parsing library elementtree (see Resources) and some supporting libraries. Importing ElementTree on different distributions can involve a slightly different syntax, depending on how you install the module. The example code is from openSUSE with the module installed from source, but Ubuntu might require from xml.etree.ElementTree import parse. The module str allows some string manipulations, sys gives you access to files, and cgi provides an all-important escaping routine essential in handling data for XML. The code expects to get a command-line interface (CLI) argument telling it whether to translate from CMU Sphinx format or HTK/VoxForge. The example code then opens the file for output and writes the XML prologue suitable for PLS. Because you're not storing any Unicode characters at this stage, it's sufficient to open the file for plain ASCII access only.

At this point, the code in Listing 5:

  • Processes the source file line by line, splitting the fields into separate strings and writing the lexeme, grapheme, and phoneme components
  • Identifies the phoneme with attribute alphabet="x-htk-voxforge" if the incoming data is from the VoxForge lexicon and with alphabet="x-cmusphinx" if the data is from Sphinx
  • Retains the case of the phonemes

When the first field is imported it may contain characters, such as ampersand (&), that cause problems in the XML unless escaped with cgi.escape().

Finally, the code:

  • Writes the closing tags
  • Closes the PLS file and then reloads it as an XML file
  • Reads through the file counting lexeme elements
  • Reports the count of lexemes

If the count is reported, then the XML appears to be robust and well-formed.

Listing 6 is an extract of the result from VoxForge HTK lexicon.

Listing 6. Extract of result from VoxForge HTK lexicon
...
<lexeme>
  <grapheme>AGENDA</grapheme>
  <phoneme alphabet="x-htk-voxforge">ax jh eh n d ax</phoneme>
</lexeme>
<lexeme> 
  <grapheme>AGENT</grapheme>
  <phoneme alphabet="x-htk-voxforge">ey jh ih n t</phoneme>
</lexeme>
...

Conversion from PLS

It's important to know that you can easily get back from PLS format to a flat file. The code in Listing 7 assumes that you have your lexicon stored in a PLS-formatted file and that your speech recognition project can only use a flat file in either HTK or CMU Sphinx format.

Listing 7. Conversion from PLS
from elementtree.ElementTree import parse
import string as str
import sys
#
# call with 
#	python pls2flat.py x-htk-voxforge > mylexicon
# or 
#	python pls2flat.py x-cmusphinx > mylexicon.dic
#
if len(sys.argv) > 1:
  alpha = sys.argv[1]
#
if alpha == "x-htk-voxforge":
  tree = parse("mylexvox.pls")
else:
  tree = parse("mylexspx.pls")
lexicon = tree.getroot()
for lexeme in lexicon:
  for child in lexeme:
    #print child.tag
    if child.tag[-8:] == 'grapheme':
      if alpha == 'x-htk-voxforge':
	gr = str.upper(child.text)
	print gr,"\t","["+gr+"]","\t",
      else:
	gr = child.text
	print gr,"\t",
    if child.tag[-7:] == 'phoneme':
      if child.get('alphabet') == alpha:
	print child.text

This short script uses the elementtree library to parse the PLS XML file. It establishes the root element and then iterates over the child lexemes, looking for the grapheme and phoneme, and writes the values to a text file in the relevant format. The script asks for the last eight characters in the tag for grapheme since there will be a namespace prefix returned with the tag. It recreates the three-field format for HTK and two fields for CMU Sphinx.


Merging and dealing with Unicode

The script in Listing 8 uses two PLS files to create one common PLS file that contains information for both of the original files. It also translates the VoxForge phoneme string to Unicode and stores the Unicode version in a separate phoneme element identified with the alphabet="ipa" attribute.

Listing 8. Merging and Unicode
#! /usr/bin/python -u
# -*- coding: utf-8 -*-
#
# challenge is to merge two pls files
# given two pls files, merge them into one
# 
import elementtree.ElementTree as ET
from elementtree.ElementTree import parse
import string as str
import codecs
import cgi
#
treevox = ET.parse("mylexvox.pls")
treespx = ET.parse("mylexspx.pls")
#
lexvox = treevox.getroot()
lexspx = treespx.getroot()
#
phons = { 'aa':u'ɑ','ae':u'æ','ah':u'ʌ','ao':u'ɔ','ar':u'ɛr','aw':u'aʊ',
'ax':u'ə','ay':u'aɪ','b':u'b','ch':u'tʃ','d':u'd','dh':u'ð','eh':u'ɛ',
'el':u'ɔl','en':u'ɑn','er':u'ər','ey':u'eɪ','f':u'f',
'g':u'ɡ','hh':u'h','ih':u'ɪ','ir':u'ɪr','iy':u'i','jh':u'dʒ','k':u'k','l':u'l',
'm':u'm','n':u'n','ng':u'ŋ','ow':u'oʊ','oy':u'ɔɪ','p':u'p','r':u'r','s':u's',
'sh':u'ʃ','t':u't','th':u'θ','uh':u'ʊ','ur':u'ʊr','uw':u'u','v':u'v',
'w':u'w','y':u'j','z':u'z','zh':u'ʒ','sil':'' }
#
def to_utf(s):
  myp = str.split(s)
  myipa = []
  for p in myp:
    myipa.append(phons[p])
  return str.join(myipa,'')
#
outfile = "my2lexmrg.pls"
out = codecs.open(outfile, encoding='utf-8', mode='w')
#
out.write('<?xml version="1.0" encoding="UTF-8"?>\n\
<lexicon version="1.0"\n \
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"\n\
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\n \
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon\n \
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"\n\
      alphabet="ipa" xml:lang="en">')
#
# scan the two pls, create dictionary
voxdict = {}
for lexeme in lexvox:
  gr = str.lower(lexeme[0].text)
  ph = lexeme[1].text
  voxdict[gr] = {ph,}
#
for lexeme in lexspx:
  gr = lexeme[0].text
  ph = lexeme[1].text
  if gr in voxdict:
    voxdict[gr].add(ph)
  else:
    voxdict[gr] = {ph,}
#
for gr in sorted(voxdict.iterkeys()):
  out.write('\n\
  <lexeme>\n\
    <grapheme>'+cgi.escape(gr)+'</grapheme>')
  #print "%s: %s" % (key, voxdict[key])
  for ph in sorted(voxdict[gr]):
    alph = 'x-htk-voxforge' if ph.islower() else 'x-cmusphinx'
    out.write('\n\
    <phoneme alphabet="'+alph+'">'+ph+'</phoneme>')
    if ph.islower():
      phipa = to_utf(ph)
      out.write(u'\n\
    <phoneme alphabet="ipa">'+phipa+'</phoneme>')
  out.write('\n\
  </lexeme>')
# done, close files
out.write('\n</lexicon>')
out.close()
# now check the output is ok
tree = parse(outfile)
lexicon = tree.getroot()
mylexcount = 0
for lexeme in lexicon:
  mylexcount += 1
print 'Found %(number)d lexemes' % {"number":mylexcount}

You begin with a hashbang (#!) expression followed by a special indicator to the Python interpreter, on the second line, that this code contains Unicode characters. The script then imports a number of modules, including elementtree, codecs, and cgi, that are useful when dealing with Unicode. You tell the interpreter where the two PLS files are and point to their root elements.

The phons variable stores a special dictionary that contains a mapping from the CMU Arpabet codes to an equivalent Unicode combination. This dictionary translates existing phoneme strings to a Unicode version. Feel free to modify the mapping for your own purposes—for example, you might feel that the equivalent of 'aa' in Unicode is u'ɑ:', which lengthens the a sound.

A single defined function, to_utf(), that translates an ASCII Arpabet string to Unicode. The final part of the foundation is to open a file for storage of the output, making sure that this file knows it must be prepared to accept Unicode, and to write the PLS prologue into it.

Now you're ready to process the files by creating two special internal Python dictionaries, one from each of the PLS files, by scanning them with the elementtree library. It is assumed that the grapheme will be the first child and the phoneme the second child of the lexeme. The script adds all of the records from the first file to a new merge dictionary. When scanning the second file, if a key already exists in the new merge dictionary you add to its set of phonemes. If not, you create a new key item in the merged dictionary. At the end of the loop, the new merged dictionary contains keys from both the original files and an associated set of one or two phoneme strings.

Write the new PLS file from the merge you just created. As you scan through the dictionary, you add the alphabet attribute to distinguish one phoneme from another. Having written out the existing phonemes, you create a new phoneme string that is the Unicode equivalent of the CMU Arpabet string, which you can get from the HTK or the Sphinx version (or both) according to your needs.

Finally, close out the root element, close the file, and parse it again as you did before to check that it is well formed.

The result should look something like Listing 9.

Listing 9. Merge results
...
  <lexeme>
    <grapheme>agenda</grapheme>
    <phoneme alphabet="x-cmusphinx">AH JH EH N D AH</phoneme>
    <phoneme alphabet="x-htk-voxforge">ax jh eh n d ax</phoneme>
    <phoneme alphabet="ipa">ədʒɛndə</phoneme>
  </lexeme>
  <lexeme> 
    <grapheme>agendas</grapheme>
    <phoneme alphabet="x-cmusphinx">AH JH EH N D AH Z</phoneme>
  </lexeme>
  <lexeme> 
    <grapheme>agent</grapheme>
    <phoneme alphabet="x-cmusphinx">EY JH AH N T</phoneme>
    <phoneme alphabet="x-htk-voxforge">ey jh ih n t</phoneme> 
    <phoneme alphabet="ipa">eɪdʒɪnt</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>agent's</grapheme>
    <phoneme alphabet="x-cmusphinx">EY JH AH N T S</phoneme>
  </lexeme>
...

With the merged PLS dictionary, you can apply Extensible Stylesheet Language (XSL) or any other procedures to generate the results you need, whether flat file or a new specific PLS file. In theory, you can store other phoneme strings in this file as well, even from other languages. However, that is a nonstandard use of the PLS specification.

Kai Schott has done a lot of work with PLS and has several already prepared files for download in different languages, particularly German (see Resources).


Unresolved issues

Though you can get a lot of information from the flat files, the following issues remain unresolved.

Selecting from multiple graphemes
On occasion, you need to deal with multiple spellings for a word in the same language. The role and the phoneme are identical for both orthographies, so you have multiple graphemes in the same lexeme. However, there is nothing in the PLS that allows you to add an attribute to the grapheme element as there is with the alphabet attribute of the phoneme element.
Acronyms
Lexicons frequently contain acronyms. The PLS deals with these in a child element of lexeme called an <alias>.To build a PLS automatically from a flat file, you need a way to distinguish the acronyms from the real words in the lexicon. The flat files don't necessarily have that information.
Role/part of speech
As with acronyms, part-of-speech information is not available from the flat files to build role attributes into the PLS.

Conclusion

In this article, you learned to move quite easily between flat and PLS formats. Storing lexicons in PLS format has potential advantages. Open source projects may or may not move to XML for resource files. That is a decision for project managers to make—they alone can assess resources and apply them in the best way generally accepted in their communities.

In the meantime, you can employ XML standards in your own work without loss of functions. For lexicons and dictionaries used in voice recognition work, you can increase cross-project adaptability, robustness, and utility by storing customized lexicons in the PLS format. You can easily extract data as needed into the required flat file.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=824089
ArticleTitle=Move toward open source standards in speech processing
publish-date=08072012