Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

VoiceXML 2.1

The upgrades are few, but significant

Jeff Kusnitz (jk@us.ibm.com), Senior Software Engineer, IBM
Jeff Kusnitz has been with IBM for nearly 20 years.  He spent 16 of those years working with speech and telephony technologies, including several years as IBM's representative to several industry standards organizations.  He now spends his time on Web search technologies, focusing primarily on indexing open source information.

Summary:  VoiceXML 2.1 was never intended to be a replacement for VoiceXML 2.0. The intent was that it include a small number of features that were not included in VoiceXML 2.0, but were deemed significant enough to warrant documenting and standardizing. VoiceXML 2.1 has met these goals -- it contains just eight features, of which only two are completely new; the other six are enhancements to existing VoiceXML elements. In this article, learn about these many features, as well as a few other significant areas of work in the voice application/standards arena.

Date:  15 Aug 2006
Level:  Introductory

Comments:  

VoiceXML 2.1: the <data> tag

By far, the most interesting change in VoiceXML 2.1 is the introduction of the <data> tag. The <data> tag is the VoiceXML equivalent to Asynchronous JavaScript + XML (Ajax) in HTML. With the <data> tag, it's not necessary for a VoiceXML application to transition from dialog to dialog when it needs to send information to or receive information from a Web server. Instead, the <data> tag is used to perform the equivalent of an HTTP get (or post) request in situ and fetch a block of XML data. When the request completes, the VoiceXML application can parse the results using ECMAScript and DOM.

Take a look at an application for the fictitious business "Joe's Tuna Shack." Joe's would like to put together an application that lets callers find the closest shack and provide directions. The dialog might go something like this:

    Browser: Welcome to Joe's Tuna Shack store locator; if you tell me your zip code, I 
    can help you find the closest shack.
    Caller: 94086
    B: 94086, got it. One moment while I look that up.
    B: Berkeley, California. There are 4 shacks in Berkeley. 
    Say "that one" or the name of the street the shack is on for directions:
    B: 500 Telegraph Avenue;
    B: 486 Dwight Way;
    B: 1719 Shattuck Avenue;
    C: That one
    C: Got it. Let me get you directions to the shack at 1719 Shattuck Avenue
    ...

If this were to be implemented in VoiceXML 2.0, you could imagine at least two separate dialogs and corresponding transitions -- an initial dialog that greets the caller, collects the zip code, and sends it to the server, and another dialog that figures out which address the caller is interested in.

Thanks to the <data> tag, this can all be done in a single dialog. There is still, of course, a round trip to the Web server to send the zip code and retrieve the restaurant locations, but the dialog never changes, so the application code is simpler. The code to get the zip code and then get the corresponding addresses from the server might look like the code in Listing 1.


Listing 1. Get zip code and corresponding addresses
<form>
  <var name="stores" expr="''"/>
  <block>
    <prompt>
      Welcome to Joe's Tuna Shack Store Locator
    </prompt>
  </block>
  <field name="zipcode" type="digits">
    <prompt>
      What's your zip code?
    </prompt>
    <filled>
      <!-- use the zip code to fetch the closest stores.. -->
      <data name="locations"
            src="http://shacklocs.com/cgi-bin/getlocs.pl"
            namelist="zipcode"/>

      <!-- parse the results to get the addresses and store numbers;  -->
      <!-- I cheat a little here; I know there are three location     -->
      <!-- elements, each with an address child and a storenum child. -->
      <script>
        <![CDATA[
          stores = new Array();

          var root = locations.documentElement;

          var addrs = root.getElementsByTagName( "address" );
          var nums = root.getElementsByTagName( "storenum" );

          for( var i = 0; i < addrs.length; i++ ){
            var addrNode = addrs.item(i);
            var numNode = nums.item(i);

            stores[i] = new Object();
            stores[i].addr = addrNode.text;
            stores[i].num = numNode.text;
            stores[i].mark = i;
          }
        ] ] >
      </script>
    </filled>
  </field>
  ...
</form>

Additionally, the XML file retrieved by the <data> tag would look like Listing 2.


Listing 2. Retrieved XML file

<?xml version="1.0" encoding="UTF-8"?>
   <?access-control allow="*"?>
<locations>
  <location>
     <address> 500 Telegraph Avenue </address>
     <storenum> 880 </storenum>
  </location>
  <location>
     <address> 486 Dwight Way </address>
     <storenum> 237 </storenum>
  </location>
  <location>
     <address> 1719 Shattuck Avenue </address>
     <storenum> 101 </storenum>
  </location>
</locations>

Note that I included the optional <access-control> processing instruction in my XML document; obviously, it's not necessary in this example, but security is always a good thing.

The zipcode field's filled handler uses the <data> element to fetch an XML document containing the addresses of the nearest stores, then parses the document and stores the results in an ECMAScript array of objects. But what do you do with this information? Ideally, I would make a prompt that iterated over them, and maybe even a grammar so that someone could select them by voice.

The <foreach> tag

That's exactly what the <foreach> tag, which is new in VoiceXML 2.1, was intended to do. When an application has a list of items (an ECMAScript array) that it needs to speak to the user, it can use a <foreach> tag within a prompt.

The syntax of the <foreach> tag is amazingly simple. It takes two attributes, one specifying the name of the array (called "array"), and one specifying the name of the variable that will be used to reference array items within the <foreach> tag (called "item").

To continue the example application, a field listing the addresses nearest the caller would look something like Listing 3.


Listing 3. Field listing addresses
<field name="getstore">
  <grammar .../>
  <prompt count="1">Ok, I've got a list of nearby stores.</prompt>
  <prompt>
    You can either say <emphasis level="strong">that one</emphasis>
    or say the shack address.
  </prompt>
  <prompt>
    <foreach item="st" array="stores">
      <value expr="st.addr"/><break time="250ms"/>
    </foreach>
  </prompt>
  <filled>
   ...
  </filled>
</field>

So now the output part is complete, but there are still a couple of things missing.

First, you want a grammar that includes all of the addresses so the caller can either say "that one" at the appropriate time, or blurt out the actual address to select a shack location; and second, if the caller does say "that one," you need a way to tell which address was currently being played.

As fate would have it, there are two more features in VoiceXML 2.1 that you can take advantage of to address both of these items:

  • The <mark> tag: To detect where/when within a prompt a user barged in, you can use the <mark> tag. Any number of <mark> tags can be placed within a prompt. As the prompt is played to the caller, the VoiceXML browser keeps track of the marks as they are encountered; when the caller barges in, the last encountered mark is returned in a shadow variable along with the recognition results.
    For my example, I'll use the stores[n].mark variables that I set up as the <mark> tag's nameexpr attribute to keep track of where the caller barged in.
  • The <grammar> srcexpr attribute: The grammar problem is also readily solvable using the existing <grammar> tag, but with a new for VoiceXML 2.1 attribute, the srcexpr attribute. In the past, when a grammar was specified in a VoiceXML application, the URL for the grammar was specified by a static string (the src attribute), decided upon when the dialog was "written" (either by the application developer, or by some combination of application servers or servlets at run time). To customize a grammar based on input to a dialog meant the dialog itself would need to be rewritten (which meant a trip back to the Web server).

    The srcexpr attribute avoids this headache. With it, the URI for the grammar can be constructed at run time, based on user input. In my example, I can include the zip code itself in the URI for the grammar, so that the grammar-generating servlet knows which addresses to include in the grammar it produces.

When you add the <mark> tag and updated <grammar> tag to the field above, you end up with code like Listing 4.


Listing 4. Adding <mark> and <grammar> tags
<field name="getstore" type="digits">
  <grammar type="application/srgs+xml"
               srcexpr="'http://shacklocs.com/cgi-bin/getgram.pl?zipcode=' + zipcode"/>
  <prompt count="1">Ok, I've got a list of nearby stores.</prompt>
  <prompt>
    You can either say <emphasis level="strong">that one</emphasis>
    or say the shack address.
  </prompt>
  <prompt>
    <foreach item="st" array="stores">
      <value expr="st.addr"/><break time="250ms"/>
      <mark nameexpr="st.mark"/>
    </foreach>
  </prompt>
  <filled>
    <var name="idx" expr="0"/>
    <if cond="getstore$.markname != undefined">
      <assign name="idx" expr="getstore$.markname"/>
    </if>
    <prompt>
      Got it. Let me get you directions to shack number 
      <value expr="stores[idx].num"/> at <value expr="stores[idx].addr"/>.
    </prompt>
    <exit/>
  </filled>
</field>

I simplified the filled handler a bit (but it still works, of course). In a real application, you'd probably not want to assume that the caller wanted the first item if markname was undefined; you'd want to check the marktime shadow variable to make sure a significant amount of the prompt was played.

And, of course, if the caller spoke an actual address rather than saying "that one," the markname isn't particularly useful -- the code should figure out which location was spoken, most likely with semantic interpretation tags embedded within the grammar.

Capturing utterances

Another new feature of VoiceXML 2.1 is the ability to capture a user's speech while it's being recognized. In a stock trading application for example, a caller can be prompted with "Are you sure you want to sell all of your stock?" and in addition to recognizing whether they said "yes" or "no," you would capture their speech, just in case they later change their mind and deny having said "yes." The captured audio is made available to the VoiceXML application through a shadow variable along with the other recognition results. Listing 5 demonstrates the use of this feature:


Listing 5. Use of captured audio
<form>
  <property name="recordutterance" value="true"/>

  <field name="rusure" type="boolean">
    <prompt>Are you sure you want to sell all of your stock?</prompt>
    <filled>
      <var name="captured" expr="application.lastresult$.recording"/>
      <submit namelist="rusure captured" method="post"
              enctype="multipart/form-data"
              next="htp://example.com/cgi-bin/sellshares.pl"/>
    </filled>
  </field>
</form>

In the example, the recordutterance property is used to "turn recording on" for the current form. When the field is filled in, the sellshares.pl script is sent the variable rusure along with the caller's audio.

After it's enabled, audio is captured for <field>, <initial>, <link>, <menu> and, optionally, <record> and <transfer> tags.

VoiceXML 2.1 wrap-up

VoiceXML 2.1 also includes an updated <script> tag. As was the case with the <grammar> tag, a srcexpr attribute was added to let applications dynamically specify external ECMAScript files to be loaded.

In addition, both the <disconnect> tag and the <transfer> tag have been enhanced. A namelist attribute has been added to the <disconnect> tag. With this attribute, an application can indicate "results," which are returned to whatever environment started the VoiceXML application, a CCXML application for example.

A new type attribute has been added to the <transfer> tag. It lets an application specify whether a transfer is "blind," "bridge," or "consultation." Both "blind" and "bridge" behave exactly as VoiceXML 2.0's "blind" and "bridge" attributes did. A "consultation" transfer is similar to a blind transfer, except that callers aren't disconnected from the VoiceXML browser if the transfer fails to complete.

Related work

The W3C's Voice Browser Working Group isn't working solely on VoiceXML improvements and enhancements. They're also actively developing standards for many aspects of VoiceXML application development. Since releasing the VoiceXML 2.0 Standard, the group has also nearly completed the Semantic Interpretation Specification (currently a Candidate Recommendation), released a Last Call Working Draft of the Call Control Markup Language (CCXML) specification, and released a working draft of a Pronunciation Lexicon Specification (currently a Working Draft).

But the W3C isn't the only group that's involved with voice standards; there are a number of groups. Two specifically worth mentioning here are the VoiceXML Forum and the Internet Engineering Task Force (IETF).

The VoiceXML Forum

The VoiceXML Forum, which many will remember as the group that delivered the VoiceXML 1.0 specification, has more or less moved out of the specification-writing arena and is now focused primarily on marketing and education in the VoiceXML industry. To this effect, the Forum has put together a pair of certifications, one for VoiceXML platforms to certify they are compliant with the VoiceXML 2.0 specification and one for developers, to demonstrate that they have a well-balanced understanding of the VoiceXML 2.0 language. At the time this article was written, 17 VoiceXML platforms have been certified by the VoiceXML Forum, and more than 100 developers have taken and passed the Forum's developer certification examination.

Media Resource Control Protocol (MRCP)

For those who are not necessarily interested in developing voice applications, but are instead interested in building voice application platforms (that is, a VoiceXML browser), the CATS group in the IETF is working on Media Resource Control Protocol version 2 (MRCP v2). MRCP v2 builds on the MRCP specification jointly developed by Cisco, Nuance, and SpeechWorks several years ago, which is now supported by most, if not all, of the major players in the industry, either by providing MRCP-based speech resources, or by providing VoiceXML browsers that "sit on top of" MRCP-based speech resource.

MRCP defines a messaging protocol for voice application platforms to use to configure and control speech resources. When an MRCP-based VoiceXML browser needs to play synthesized speech (that is, when it is processing a <prompt> tag), it would allocate an MRCP-based synthesis resource and pass on the content to be spoken in a "SPEAK" request. The synthesis resource would then render the audio and send it wherever the browser said to send it.

Likewise, for recognition, the browser would allocate a speech recognizer, pass it a number of grammars in "DEFINE-GRAMMAR" requests, then start sending audio to the recognizer along with a "RECOGNIZE" request. When something was recognized, the speech recognizer would send the recognition results back to the browser.

One of the beauties of this messaging protocol is that it's platform-independent. A voice application platform that is designed to support MRCP can, with minimal work, switch between different technology vendors' products. If, for example, I've written a VoiceXML Browser that uses MRCP and works great with vendor X's U.S. English speech engines, but I have a need to support German recognition and synthesis, I should be able to just point my browser at vendor Y's German speech engines and everything would magically work.

A detailed delve into MRCP is, unfortunately, well beyond the scope of this article. But for interested readers, whether platform developers or other, the specification is well worth reading.

Conclusion

Someone once said "the nice thing about standards is that there are so many to choose from" (I searched online for the originator of the quote, but there were just as many purported originators as there are standards). This may have been true at one time, and perhaps is still true in some areas, but in the voice application space, VoiceXML and related standards (SRGS, SSML, and so on) have emerged as "the" standards for building voice applications and platforms. And judging by the usefulness of the features released in VoiceXML 2.1, there's no reason to think that VoiceXML won't continue to be the standard for building voice applications for a long time coming.


Resources

Learn

Discuss

About the author

Jeff Kusnitz has been with IBM for nearly 20 years.  He spent 16 of those years working with speech and telephony technologies, including several years as IBM's representative to several industry standards organizations.  He now spends his time on Web search technologies, focusing primarily on indexing open source information.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=153945
ArticleTitle=VoiceXML 2.1
publish-date=08152006
author1-email=jk@us.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).