By far, the most interesting change in VoiceXML 2.1 is the introduction of the <data> tag. The <data> tag is the VoiceXML equivalent to Asynchronous JavaScript + XML (Ajax) in HTML. With the <data> tag, it's not necessary for a VoiceXML application to transition from dialog to dialog when it needs to send information to or receive information from a Web server. Instead, the <data> tag is used to perform the equivalent of an HTTP get (or post) request in situ and fetch a block of XML data. When the request completes, the VoiceXML application can parse the results using ECMAScript and DOM.
Take a look at an application for the fictitious business "Joe's Tuna Shack." Joe's would like to put together an application that lets callers find the closest shack and provide directions. The dialog might go something like this:
Browser: Welcome to Joe's Tuna Shack store locator; if you tell me your zip code, I
can help you find the closest shack.
Caller: 94086
B: 94086, got it. One moment while I look that up.
B: Berkeley, California. There are 4 shacks in Berkeley.
Say "that one" or the name of the street the shack is on for directions:
B: 500 Telegraph Avenue;
B: 486 Dwight Way;
B: 1719 Shattuck Avenue;
C: That one
C: Got it. Let me get you directions to the shack at 1719 Shattuck Avenue
... |
If this were to be implemented in VoiceXML 2.0, you could imagine at least two separate dialogs and corresponding transitions -- an initial dialog that greets the caller, collects the zip code, and sends it to the server, and another dialog that figures out which address the caller is interested in.
Thanks to the <data> tag, this can all be done in a single dialog. There is still, of course, a round trip to the Web server to send the zip code and retrieve the restaurant locations, but the dialog never changes, so the application code is simpler. The code to get the zip code and then get the corresponding addresses from the server might look like the code in Listing 1.
Listing 1. Get zip code and corresponding addresses
<form>
<var name="stores" expr="''"/>
<block>
<prompt>
Welcome to Joe's Tuna Shack Store Locator
</prompt>
</block>
<field name="zipcode" type="digits">
<prompt>
What's your zip code?
</prompt>
<filled>
<!-- use the zip code to fetch the closest stores.. -->
<data name="locations"
src="http://shacklocs.com/cgi-bin/getlocs.pl"
namelist="zipcode"/>
<!-- parse the results to get the addresses and store numbers; -->
<!-- I cheat a little here; I know there are three location -->
<!-- elements, each with an address child and a storenum child. -->
<script>
<![CDATA[
stores = new Array();
var root = locations.documentElement;
var addrs = root.getElementsByTagName( "address" );
var nums = root.getElementsByTagName( "storenum" );
for( var i = 0; i < addrs.length; i++ ){
var addrNode = addrs.item(i);
var numNode = nums.item(i);
stores[i] = new Object();
stores[i].addr = addrNode.text;
stores[i].num = numNode.text;
stores[i].mark = i;
}
] ] >
</script>
</filled>
</field>
...
</form>
|
Additionally, the XML file retrieved by the <data> tag would look like Listing 2.
Listing 2. Retrieved XML file
<?xml version="1.0" encoding="UTF-8"?>
<?access-control allow="*"?>
<locations>
<location>
<address> 500 Telegraph Avenue </address>
<storenum> 880 </storenum>
</location>
<location>
<address> 486 Dwight Way </address>
<storenum> 237 </storenum>
</location>
<location>
<address> 1719 Shattuck Avenue </address>
<storenum> 101 </storenum>
</location>
</locations>
|
Note that I included the optional <access-control> processing instruction in my XML document; obviously, it's not necessary in this example, but security is always a good thing.
The zipcode field's filled handler uses the <data> element to fetch an XML document containing the addresses of the nearest stores, then parses the document and stores the results in an ECMAScript array of objects. But what do you do with this information? Ideally, I would make a prompt that iterated over them, and maybe even a grammar so that someone could select them by voice.
That's exactly what the <foreach> tag, which is new in VoiceXML 2.1, was intended to do. When an application has a list of items (an ECMAScript array) that it needs to speak to the user, it can use a <foreach> tag within a prompt.
The syntax of the <foreach> tag is amazingly simple. It takes two attributes, one specifying the name of the array (called "array"), and one specifying the name of the variable that will be used to reference array items within the <foreach> tag (called "item").
To continue the example application, a field listing the addresses nearest the caller would look something like Listing 3.
Listing 3. Field listing addresses
<field name="getstore">
<grammar .../>
<prompt count="1">Ok, I've got a list of nearby stores.</prompt>
<prompt>
You can either say <emphasis level="strong">that one</emphasis>
or say the shack address.
</prompt>
<prompt>
<foreach item="st" array="stores">
<value expr="st.addr"/><break time="250ms"/>
</foreach>
</prompt>
<filled>
...
</filled>
</field>
|
So now the output part is complete, but there are still a couple of things missing.
First, you want a grammar that includes all of the addresses so the caller can either say "that one" at the appropriate time, or blurt out the actual address to select a shack location; and second, if the caller does say "that one," you need a way to tell which address was currently being played.
As fate would have it, there are two more features in VoiceXML 2.1 that you can take advantage of to address both of these items:
- The
<mark>tag: To detect where/when within a prompt a user barged in, you can use the<mark>tag. Any number of<mark>tags can be placed within a prompt. As the prompt is played to the caller, the VoiceXML browser keeps track of the marks as they are encountered; when the caller barges in, the last encountered mark is returned in a shadow variable along with the recognition results.
For my example, I'll use thestores[n].markvariables that I set up as the<mark>tag'snameexprattribute to keep track of where the caller barged in. - The
<grammar> srcexprattribute: The grammar problem is also readily solvable using the existing<grammar>tag, but with a new for VoiceXML 2.1 attribute, thesrcexprattribute. In the past, when a grammar was specified in a VoiceXML application, the URL for the grammar was specified by a static string (thesrcattribute), decided upon when the dialog was "written" (either by the application developer, or by some combination of application servers or servlets at run time). To customize a grammar based on input to a dialog meant the dialog itself would need to be rewritten (which meant a trip back to the Web server).The
srcexprattribute avoids this headache. With it, the URI for the grammar can be constructed at run time, based on user input. In my example, I can include the zip code itself in the URI for the grammar, so that the grammar-generating servlet knows which addresses to include in the grammar it produces.
When you add the <mark> tag and updated <grammar> tag to the field above, you end up with code like Listing 4.
Listing 4. Adding <mark> and <grammar> tags
<field name="getstore" type="digits">
<grammar type="application/srgs+xml"
srcexpr="'http://shacklocs.com/cgi-bin/getgram.pl?zipcode=' + zipcode"/>
<prompt count="1">Ok, I've got a list of nearby stores.</prompt>
<prompt>
You can either say <emphasis level="strong">that one</emphasis>
or say the shack address.
</prompt>
<prompt>
<foreach item="st" array="stores">
<value expr="st.addr"/><break time="250ms"/>
<mark nameexpr="st.mark"/>
</foreach>
</prompt>
<filled>
<var name="idx" expr="0"/>
<if cond="getstore$.markname != undefined">
<assign name="idx" expr="getstore$.markname"/>
</if>
<prompt>
Got it. Let me get you directions to shack number
<value expr="stores[idx].num"/> at <value expr="stores[idx].addr"/>.
</prompt>
<exit/>
</filled>
</field>
|
I simplified the filled handler a bit (but it still works, of course). In a real application, you'd probably not want to assume that the caller wanted the first item if markname was undefined; you'd want to check the marktime shadow variable to make sure a significant amount of the prompt was played.
And, of course, if the caller spoke an actual address rather than saying "that one," the markname isn't particularly useful -- the code should figure out which location was spoken, most likely with semantic interpretation tags embedded within the grammar.
Another new feature of VoiceXML 2.1 is the ability to capture a user's speech while it's being recognized. In a stock trading application for example, a caller can be prompted with "Are you sure you want to sell all of your stock?" and in addition to recognizing whether they said "yes" or "no," you would capture their speech, just in case they later change their mind and deny having said "yes." The captured audio is made available to the VoiceXML application through a shadow variable along with the other recognition results. Listing 5 demonstrates the use of this feature:
Listing 5. Use of captured audio
<form>
<property name="recordutterance" value="true"/>
<field name="rusure" type="boolean">
<prompt>Are you sure you want to sell all of your stock?</prompt>
<filled>
<var name="captured" expr="application.lastresult$.recording"/>
<submit namelist="rusure captured" method="post"
enctype="multipart/form-data"
next="htp://example.com/cgi-bin/sellshares.pl"/>
</filled>
</field>
</form>
|
In the example, the recordutterance property is used to "turn recording on" for the current form. When the field is filled in, the sellshares.pl script is sent the variable rusure along with the caller's audio.
After it's enabled, audio is captured for <field>, <initial>, <link>, <menu> and, optionally, <record> and <transfer> tags.
VoiceXML 2.1 also includes an updated <script> tag. As was the case with the <grammar> tag, a srcexpr attribute was added to let applications dynamically specify external ECMAScript files to be loaded.
In addition, both the <disconnect> tag and the <transfer> tag have been enhanced. A namelist attribute has been added to the <disconnect> tag. With this attribute, an application can indicate "results," which are returned to whatever environment started the VoiceXML application, a CCXML application for example.
A new type attribute has been added to the <transfer> tag. It lets an application specify whether a transfer is "blind," "bridge," or "consultation." Both "blind" and "bridge" behave exactly as VoiceXML 2.0's "blind" and "bridge" attributes did. A "consultation" transfer is similar to a blind transfer, except that callers aren't disconnected from the VoiceXML browser if the transfer fails to complete.
The W3C's Voice Browser Working Group isn't working solely on VoiceXML improvements and enhancements. They're also actively developing standards for many aspects of VoiceXML application development. Since releasing the VoiceXML 2.0 Standard, the group has also nearly completed the Semantic Interpretation Specification (currently a Candidate Recommendation), released a Last Call Working Draft of the Call Control Markup Language (CCXML) specification, and released a working draft of a Pronunciation Lexicon Specification (currently a Working Draft).
But the W3C isn't the only group that's involved with voice standards; there are a number of groups. Two specifically worth mentioning here are the VoiceXML Forum and the Internet Engineering Task Force (IETF).
The VoiceXML Forum, which many will remember as the group that delivered the VoiceXML 1.0 specification, has more or less moved out of the specification-writing arena and is now focused primarily on marketing and education in the VoiceXML industry. To this effect, the Forum has put together a pair of certifications, one for VoiceXML platforms to certify they are compliant with the VoiceXML 2.0 specification and one for developers, to demonstrate that they have a well-balanced understanding of the VoiceXML 2.0 language. At the time this article was written, 17 VoiceXML platforms have been certified by the VoiceXML Forum, and more than 100 developers have taken and passed the Forum's developer certification examination.
Media Resource Control Protocol (MRCP)
For those who are not necessarily interested in developing voice applications, but are instead interested in building voice application platforms (that is, a VoiceXML browser), the CATS group in the IETF is working on Media Resource Control Protocol version 2 (MRCP v2). MRCP v2 builds on the MRCP specification jointly developed by Cisco, Nuance, and SpeechWorks several years ago, which is now supported by most, if not all, of the major players in the industry, either by providing MRCP-based speech resources, or by providing VoiceXML browsers that "sit on top of" MRCP-based speech resource.
MRCP defines a messaging protocol for voice application platforms to use to configure and control speech resources. When an MRCP-based VoiceXML browser needs to play synthesized speech (that is, when it is processing a <prompt> tag), it would allocate an MRCP-based synthesis resource and pass on the content to be spoken in a "SPEAK" request. The synthesis resource would then render the audio and send it wherever the browser said to send it.
Likewise, for recognition, the browser would allocate a speech recognizer, pass it a number of grammars in "DEFINE-GRAMMAR" requests, then start sending audio to the recognizer along with a "RECOGNIZE" request. When something was recognized, the speech recognizer would send the recognition results back to the browser.
One of the beauties of this messaging protocol is that it's platform-independent. A voice application platform that is designed to support MRCP can, with minimal work, switch between different technology vendors' products. If, for example, I've written a VoiceXML Browser that uses MRCP and works great with vendor X's U.S. English speech engines, but I have a need to support German recognition and synthesis, I should be able to just point my browser at vendor Y's German speech engines and everything would magically work.
A detailed delve into MRCP is, unfortunately, well beyond the scope of this article. But for interested readers, whether platform developers or other, the specification is well worth reading.
Someone once said "the nice thing about standards is that there are so many to choose from" (I searched online for the originator of the quote, but there were just as many purported originators as there are standards). This may have been true at one time, and perhaps is still true in some areas, but in the voice application space, VoiceXML and related standards (SRGS, SSML, and so on) have emerged as "the" standards for building voice applications and platforms. And judging by the usefulness of the features released in VoiceXML 2.1, there's no reason to think that VoiceXML won't continue to be the standard for building voice applications for a long time coming.
Learn
-
Get more information on the following standards and specifications:
-
There are a number of resources for VoiceXML Application Development:
-
Further your learning by becoming certified in Voice applications:
Discuss
- Participate in the discussion forum.
- developerWorks blogs: Get involved in the developerWorks community!
Jeff Kusnitz has been with IBM for nearly 20 years. He spent 16 of those years working with speech and telephony technologies, including several years as IBM's representative to several industry standards organizations. He now spends his time on Web search technologies, focusing primarily on indexing open source information.