Wielding Tools for Voice Application Development

Exploring the WebSphere Voice Toolkit

This article describes the attributes of a conversational application and the tools available to facilitate VoiceXML application development.


David Reich (dreich@us.ibm.com), Senior Software Engineer, IBM Voice Systems

David E. Reich is a Senior Software Engineer in IBM Voice Systems working on the VoiceXML project. He is leading the effort for IBM's voice application developer tools, focusing on VoiceXML and speech recognition development tools and making speech application development easier for all developers. You can contact him at dreich@us.ibm.com .

04 February 2002

This article describes the attributes of a conversational application and the tools available to facilitate VoiceXML application development.

What is a voice application?

Originally, voice applications meant dealing with your bank or credit card company over the telephone by responding to commands, such as "Please press or say one." These Interactive Voice Response (IVR) systems evolved over time from one word or discrete digit recognition systems to allow a few basic commands, such as "please say, 'Operator'" or "call Mom." You also might think of a voice application, such as IBM ViaVoice, which is a general-purpose computer dictation system.

In describing how to develop a voice application, this article goes beyond these types of voice applications. The fundamental idea behind a voice application is conversation -- one in which the user converses with a system, either in a structured dialog or menu (directed dialog application) or a more natural, freeform conversation (natural language understanding (NLU)). In any event, voice applications have gone beyond one-word systems and provide a more natural system interface than pressing buttons on a telephone.

A new presentation markup standard, VoiceXML is to Web voice applications what HTML is to visual applications. A voice application for the Web means a Web application that generates VoiceXML and provides a voice interface to data that resides behind the Web infrastructure, Hypertext Transfer Protocol (HTTP) server, Web application server, Common Gateway Interface (CGI), Practical Extraction and Reporting Language (PERL), servlets, JavaServer PagesTM (JSPsTM), and so on.

With visual applications, you use presentation markup, graphics, image maps, and more, and software packages assist you in developing these components. In voice applications, you use VoiceXML for the presentation markup, and grammars and vocabularies are used to specify the words and phrases the users can say.

Voice application basics

A VoiceXML application consists of several parts. VoiceXML defines the conversation flow, and just like an HTML application has graphics and image maps, VoiceXML applications also have constructs, such as grammars and pronunciations.

Grammars and pronunciations are critical to the functioning of a Web voice application. When a visual Web application requires some data from a user, such as a name or street address, the user just types it into an entry field. In a voice application, the voice browser must know what to listen for. Just like in everyday person-to-person conversation, a discussion topic needs a common frame of reference to carry on the conversation. The same applies for computer speech recognition. Whether the user is interacting with a menu or a field in a form, the voice recognizer needs to have a frame of reference or a set of valid words and phrases (utterances) for these values. We call these grammars.

These grammars are active in the recognition engine and define the set of words the engine returns to the application (the voice browser) when the audio is decoded and analyzed. The recognizer matches the user utterance to an entry in an active grammar, and the voice browser responds based on how you design the VoiceXML program. Unlike speaking to another person, a computer cannot draw on experiences and extrapolate inferences and hints for words and phrases it does not explicitly recognize. A common misconception is that you can just say anything and that is turned into a text string. In reality, the computer recognition system must have a set of valid utterances from which to choose to match the user utterance.

Pronunciations play a key role in a high-quality voice application. IBM ViaVoice has a vocabulary of more than 100,000 words. With the variety of data (and many made-up words) on the Web and the random nature of topics and domains, there are inevitably going to be words that need to be recognized or pronounced by the text-to-speech engine that do not sound as you might expect. As part of developing a voice application, you'll need to provide new words and their pronunciations.

Figure 1. Overall structure of a voice application
Overall structure of a voice application

Figure 1 above shows the overall structure of a voice application. On the client side of a voice Web interaction, the user interacts with a speech browser that talks to the Web on the user 's behalf. The browser gives the speech recognition engine one or a set of grammars with the valid user utterances. When the user speaks, the recognition engine uses these grammars to identify the spoken words and returns them to the browser. Based on the VoiceXML, the browser takes an action, such as making a URL request. On the server side, similar processes occur for both visual and voice applications. However, the nature of voice applications allows them to receive, interpret, and render more VoiceXML, such as by speaking (through text-to-speech (TTS)), playing prerecorded audio, or by activating more grammars in the browser and waiting for the user to say something else. This process is similar to an HTML application in its flow and structure; the significant difference lies in how the browser renders information and accepts user input.

What you need to develop a voice application

To create a VoiceXML application, you need to develop the following relevant parts:

  • VoiceXML and scripting (ECMAScript)
  • Grammars
  • Pronunciations

For quite a long time, voice applications have been a mystery. They have been a black art, where you needed some combination of PhDs to do anything useful beyond "press or say one." The goal to make voice Web applications as easy and as pervasive as HTML applications is a big challenge in that the voice interface to computer systems is not nearly as intuitive as the visual screen, keyboard, and mouse modality.

While part of the WebSphere® product line, the IBM WebSphere Voice Toolkit, hereafter called Voice Toolkit, is a standalone toolkit dedicated to developing all of the components of voice applications for IBM voice middleware. The Voice Toolkit interface in Figure 2 below looks similar to other Web toolkits. This design makes developing voice applications as familiar as creating visual ones. In addition to supporting the new Web-centric VoiceXML programming model, the Voice Toolkit also supports the elements necessary for traditional voice applications using other IBM voice middleware products, such as IBM DirectTalk.

Figure 2. IBM WebSphere Voice Toolkit interface
IBM WebSphere Voice Toolkit interface

You may or may not have a visual Web site or have done some voice application work.

Ultimately, what you need is a project-oriented toolkit to help you develop these unfamiliar components, such as grammars and lexicons, along with other, more traditional Web logic, such as servlets, beans, and JSPs.

Figure 3. The steps in writing a voice Web program
The steps in writing a voice Web program

Figure 3 above shows the steps in writing a voice Web program. The following example illustrates these steps. Let us say you want to write a voice-enabled drink selector program. The first thing you will want to do is to provide the user a way to say the desired beverage. Once the beverage name has been gathered, submit that as a Web address (such as http://www.myserver.com/servlet/getdrink?drink='soda') to the server, which ultimately responds with markup telling the voice browser to display (in this case, speak) the transaction result, as well as presenting other navigation options. This article includes a listing of the parts of such a voice application, but first you will want to know the flow of design of this selector and how the different components interact.

In any program that interacts with a user, you should first diagram or storyboard the flow or dialog with the user. This step outlines what the user inputs are, what the resulting outputs are, and how the application flows from start to finish. In this case, you want to provide the user with a way to say the desired beverage. Whether the input is visual (typing) or voice (saying) is irrelevant. The user specifies a desired beverage and the system outputs something. This is the VoiceXML program. The VoiceXML details how the browser should behave in accepting user input, calling server-side logic, speaking output, calling the aforementioned server-side logic, and performing transitions to other VoiceXML "pages" or documents.

So, quite simply, you instruct the VoiceXML to say "What would you like to drink?," have it accept some input, and based on that input, formulate a Web address request to a server (such as http://www.myserver.com/servlet/get-drink?drink='soda'). The server ultimately responds with markup telling the voice browser to display (in this case, speak) the transaction result, as well as offer other navigation options. Listing 1 below shows the first step of the development process.

As you work with the VoiceXML code, the editor assists you with highlighting and blocking to enhance readability, as well as checking the syntax when you save the file or on request (by selecting the appropriate menu item). The Content Assist feature also can help show you which tags are valid and where. The most powerful feature of the Voice Toolkit is its intersubsystem links. Recall that the voice recognition system must know the words (specifically the pronunciation of words) for them to be recognized or spoken.

More precisely, the TTS engine can make a pretty good guess at pronouncing most anything you can spell and give it, but the recognizer must have a pronunciation for the words to be recognized. If you have a word that is not in the built-in vocabularies, the TTS can make a reasonable guess at how it should sound to be spoken. However, it is vital to have a correct pronunciation of the word for the recognizer. This is a function of the Lexicon subsystem, which is described shortly. The editors (grammar and VoiceXML) flag the unknown words and provide you a link to the Lexicon subsystem to help you build the pronunciation and store it with the application.

Listing 1. VoiceXML instructions
<vxml version="1.0"> 
    <field name="drink"> 
      <prompt>Hello. What would you like to drink?</prompt> 
      <grammar src="drinks.gram"/> 
      submit="drink" method="get"/> 

Grammar editing

After writing your VoiceXML, how do you specify what the use can say? You can do this through the use of grammars. This is step 2 of your voice application development. Remember, to be recognized, words must be in the vocabulary so that the recognition engine can turn the spoken words into text, and the set of valid utterances also must be defined and scoped within a grammatical context. Every time you require input from the user, a grammar specifying the valid utterances for that interaction must be identified.

Continuing with the drink selector example, the voice browser asked the user what the user would like to drink. With the field tag in the form, the voice browser waits for the recognizer to tell it what the user said. If the user says something that is not in the grammar, an out-of-grammar (OOG) exception is thrown and handed to the VoiceXML program to respond (or the browser can handle it, where it will basically reprompt the user). If the user utters a valid grammar entry, the word or phrase is returned to the VoiceXML program where it continues (in this case making the URL request).

#JSGF 1.0; 
grammar drinks; 
public <drink> = coffee | tea | milk | soda | nothing;

The IBM WebSphere VoiceXML programming environment currently supports the JavaTM Speech Grammar Format (JSGF). The IBM DirectTalk and WebSphere Voice Server Speech Technologies packages (also part of the WebSphere Voice Server) use a different grammar format, called Speech Recognition Control Language (SRCL) or modified BNF. The syntax details of these formats are beyond the scope of this article, but they can be found in the documentation for the Voice Server or the Voice Server SDK/Voice Toolkit. The SDK is the run time environment for executing VoiceXML that you develop, while the toolkit helps you develop the VoiceXML.

For this simple example, the grammar is a basic pick list of drinks; that is, the user can say one of "coffee," "tea," "milk," "soda," or the word "nothing." Note that these are all words in a standard vocabulary. If you want to allow the user to say a specific drink name, for example, you need to provide a pronunciation for it.

If, in the course of your grammar development there are any unknown words in the grammar (something that is not in a standard vocabulary like "banango juice"), the grammar editor lets you know that a pronunciation for a word is needed and that you can invoke the Lexicon subsystem from the grammar editor to create and store pronunciations for these unknown words.

Using the grammar editor of the Voice Tookit, you can easily develop your grammars and find out if pronunciations for any words (such as user IDs, passwords, or other nonstandard words such as "banango") need to be defined for the recognizer. Once you have taken your lists of valid responses for user input, created the grammars, and placed the grammar references in your VoiceXML, you are almost done. Creating the pronunciations for any of these unknown words is the final step.

Lexicon subsystem/pronunciation builder

Perhaps the least understood part of the voice application is the lexicon. The lexicon is "a dictionary or vocabulary." In the voice systems world, the term lexicon means the set of words for which a recognition engine has access to pronunciations. Pronunciations are sets of individual sounds (called phonemes) that make up the pronunciation for a given spelling.

A speech recognition system must expend significant processing to turn your speech into text. The details of this processing are beyond the scope of this article, and there are many references on this subject. In an oversimplified description, the speech recognition system processes and cleans up the audio and mathematically processes and matches the audio with language and acoustic models to turn the audio stream into a set of phonemes. The system also attempts to match this phoneme stream with an entry in an active grammar.

A computer must have the phoneme string, or pronunciation, for each word that it is expected to recognize. While many speech recognition products have large vocabularies, the variety of words on the Web, as well as many pseudo-words (such as acronyms and other made-up words like passwords and user IDs), prevents products from being able to cover all of the words in product vocabularies. Combine this seemingly boundless set with task-specific vocabularies (such as computer terms, medical or legal terms, and so on), and it is critical to enable voice application developers to create (and correct) pronunciations for these new applications.

The Voice Toolkit has a first-of-its-kind pronunciation composer that insulates you from the subtleties of linguistics and word construction while enabling you to create pronunciations for words in the format that recognition engines require. Figure 4 below shows the pronunciation builder dialog.

Figure 4. Pronunciation builder dialog
Pronunciation builder dialog

If you use the VoiceXML or grammar editors and start the pronunciation builder, it is seeded with the unknown word and a pronunciation for the word. The TTS engine provides the pronunciation, thus giving you a good start at what the TTS engine might think the word should sound like. From there, you can click Play to hear it; if it is OK, you can save it from there. If you want to change how it sounds, click on the Composer button to see the dialog in Figure 5 below.

Figure 5. Composer dialog showing the US English phonemes
Composer dialog showing the US English phonemes

The composer dialog has buttons for each of the phonemes for a language. If you are developing a voice application for French, for example, the French phonemes display. Figure 5 above shows the US English phonemes. For each one, you see the character representation for the phoneme along with an example of a word showing you what that phoneme sounds like. If you want to hear it, you can right click on the button to hear what each sounds like.

As you use the composer to create or tune your pronunciation, you can click on each phoneme to add it at the composer's cursor position; you can click on Play to hear the resulting word. Note that while you may be inclined to type into that field, many of those characters do not easily map to keys on your computer keyboard. So, while you can use the keyboard to delete characters, use the graphical buttons to add them. Once you have your pronunciation and click OK, it is stored and available for your application.

At this point, you have your application. The VoiceXML defines the application flow or conversations with the user, the grammar entries specify the valid utterances to cause the application to take action, and the pronunciations (system supplied as well as composed by you) make up the lexicon for the application. Glue this to the existing backend Web architecture for your server-side logic, and you are off into your journey of the voice Web.

Finally, you can take these Web address requests from VoiceXML and point them to the same servlets, Beans, JSPs, and other server-side constructs that you use for HTML. By reusing the same exact Web logic and writing an alternative or complementary presentation in VoiceXML, you will leverage all of the work you put into your visual Web sites and add voice interaction capability to it. This is the goal of IBM: to make it easy for you to add these advanced capabilities with minimal to no rework.


Creating a voice Web application is not much different (structurally) from creating an HTML application. You have the core markup (the VoiceXML) and the enabling constructs, such as grammars and pronunciations. You have seen how they need to be interconnected and coordinated to create a cohesive application. The Voice Toolkit is an integrated development environment that provides more than just simple editors that understand VoiceXML markup. From the VoiceXML, you can launch the grammar builder; for unknown words, you can get, create, and tune pronunciations from either the grammar or VoiceXML editors. You can add these elements to server-side logic; to pull it all together, you can test and finally deploy your application to the server, all within this one Integrated Development Environment (IDE).

As this technology, product, and toolset evolve, you will be able to do more to bring this black art of voice application programming to the mainstream. Speech enabling Web sites is not as trivial as throwing away the graphics and reading the text. It does not, however, mandate a whole application rewrite. In fact, using existing infrastructure and business logic, in many cases unchanged, you can enhance the user experience and open up your sites to a whole new audience. Using the Voice Toolkit, Voice Server SDK, and VoiceXML, you can design robust speech interfaces using the same infrastructure and logic, just by adding a new set of presentation markup. No longer do your users need to be tethered to a computer to access your Web sites, to conduct transactions, or to enter or retrieve data. Through VoiceXML, you can give them access anywhere, any time, and ultimately on any device.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into WebSphere on developerWorks

ArticleTitle=Wielding Tools for Voice Application Development