Level: Intermediate Gerald McCobb (mccobb@us.ibm.com), Advisory Software Engineer, IBM
15 Nov 2005 Multimodal interaction added to mobile applications as Web services enhance applications by making it easy to fill in forms, get information and make selections, and be verified for accessing secure applications. This article, the first in a series, discusses the multimodal auto-fill capability.
Imagine the following scenario: A woman is on vacation in a city far from home and decides she would like to have dinner at a highly rated restaurant. She takes out her mobile phone and asks a local restaurant service for the best Italian restaurant in that city. The phone displays several choices and she picks the one closest. The restaurant takes reservations online, so she reserves a table. The restaurant also lets customers order and pay for meals in advance, so she decides to order and pay for the food online. She asks for a menu and from the menu displayed on the phone selects her dinners.
Next she pays for the food with her credit card. She signs into her credit card service by saying a sequence of displayed words into the phone. After being verified securely by the credit card service, she enters her credit card number and expiration date. Instead of typing in the numbers, she says "my credit card" and her credit card information is entered for her into the payment form. She says "my profile,” and her home address and telephone information is entered. When she arrives at the restaurant her table is ready, and after only a few minutes at the table her dinner is served.
This scenario illustrates the important role of multimodal interaction in making it easier and faster to locate a resource, enter information, and securely verify information. It can be difficult to perform these tasks with small hand-held devices such as PDAs and cell phones. With cell phones, entering text is especially difficult because there is no stylus and the keypad is dedicated primarily to entering telephone numbers. This series of articles describes how multimodal interaction can be added as a configurable service to help you automatically perform a task, such as filling in personal information. In this first installment, I cover multimodal auto-fill as a mobile Web service. Examples in this series are written in the XHTML+Voice (X+V) multimodal markup language.
Multimodal auto-fill
Multimodal auto-fill assists you in entering form input information. By extending the auto-fill capabilities of a visual browser with voice interaction, a mobile user can easily add name, address, and other personal data, also known as user preferences, to form input fields.
Auto-fill automatically completes the form input fields with the contents of the browser's user preferences dialog; the name of each user preference entry is matched to a form input field. Figure 1 shows sample entries for a user preferences dialog for a fictitious "Sandra Sandwich."
Figure 1. User preferences dialog
A visual browser uses the content of the XHTML label element, or field names defined in the ECML (Electronic Commerce Modeling Language) standard defined in RFC3106, or some other heuristic, to match a form input field with a user preference entry. Listing 1 shows XHTML input fields for e-mail address, first name, last name, address, city, state, and phone number. The visual browser (shown in Figure 2) matches these inputs with the corresponding entries stored with the user preferences.
Figure 2. Web page filled with user preferences
Multimodal auto-fill associates a speech grammar, generated from the entries in the user preferences dialog, with a VoiceXML form that prompts you to speak to the application and stores the results of what you said.
Listing 1 is an example grammar that would be generated from the user preferences table example shown in Figure 1. The grammar is in SRGS format, a standard XML grammar format defined by the World Wide Web Consortium (W3C).
Listing 1. SRGS grammar generated for auto-fill
<?xml version="1.0" encoding="ISO-8859-1"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/06/grammar
http://www.w3.org/TR/speech-grammar/grammar.xsd"
xml:lang="en-US" version="1.0" mode="voice"
root="user_profile">
<rule id="user_profile" scope="public">
<one-of>
<item>my <item repeat="0-1">personal</item> profile
<tag><![CDATA[$.first_name="Sandra";
$.last_name="Sandwich";
$.email="sandwich@example.net";
$.phone="921-555-2329";
$.address_one="22 Dandelion Way";
$.city="Saturn";
$.state="FL";
$.code="33872";
$.country="USA";]]></tag>
</item>
<item>my first name
<tag><![CDATA[$.first_name="Sandra";]]></tag>
</item>
<item>my last name
<tag><![CDATA[$.last_name="Sandwich";]]></tag>
</item>
<item>my e-mail <item repeat="0-1">address</item>
<tag><![CDATA[$.email="sandwich@example.net";]]>
</tag>
</item>
<item>my phone <item repeat="0-1">number</item>
<tag><![CDATA[$.phone="921-555-2329";]]></tag>
</item>
<item>my street address
<item repeat="0-1">line one</item>
<tag><![CDATA[$.address="22 Dandelion Way";]]>
</tag>
</item>
<item>my city
<tag><![CDATA[$.city="Saturn";]]></tag>
</item>
<item>my
<one-of>
<item>state</item>
<item>province</item>
</one-of>
<tag><![CDATA[$.state="FL";]]></tag>
</item>
<item>my
<one-of>
<item>zip</item>
<item>postal</item>
</one-of> code
<tag><![CDATA[$.code="33872";]]></tag>
</item>
<item> my country
<tag><![CDATA[$.country="USA";]]></tag>
</item>
</one-of>
</rule>
</grammar> |
Semantic interpretation is used to set the field to the preferences value when the entry name is recognized. The semantic interpretation is the text within the <tag> elements. One or more VoiceXML fields may be filled with this text. For example, the field with the name "city" is filled with "Saturn" when Sandra says "my city." The grammar supports mixed-initiative, which means that multiple VoiceXML fields are filled when the you say "my profile." The prefix "my" is used with the grammar in Listing 1 to prevent a collision with the grammars in the application.
The VoiceXML form in Listing 2 references the multimodal auto-fill grammar as a dynamically generated Java™Server Pages (JSP) page, http://example.net/user/prefs.jsp?id=sand01. The identifier parameter "sand01" uniquely specifies the user "Sandra Sandwich" and her personal information.
Listing 2. VoiceXML auto-fill Form
<?xml version="1.0" encoding="UTF-8"?>
<?xv version="1.2"?>
<vxml xmlns="http://www.w3.org/2001/vxml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
http://www.voicexml.org/specs/multimodal/x+v/mobile/12/schema/vxml.xsd"
xmlns:xv=http://www.voicexml.org/2002/xhtml+voice
version="2.0">
<!-- voice handler -->
<form id="entryForm">
<grammar
src=http://example.net/user/prefs.jsp?id=sand01"/>
<initial name="init">
<prompt>Say "my profile" to enter your profile</prompt>
<nomatch count="2">
Let's take this step by step.
<assign name="init" expr="true"/>
</nomatch>
</initial>
<field name="email" xv:id="email">
<prompt>Say "my e-mail" to enter your e-mail address.
</prompt>
</field>
<field name="first_name" xv:id="first_name">
<prompt>Say "my first name" to enter your first name.
</prompt>
</field>
<field name="last_name" xv:id="last_name">
<prompt>Say "my last name" to enter your last name.
</prompt>
</field>
<field name="address_one" xv:id="street">
<prompt>
Say "my street address" to enter your street address.
</prompt>
</field>
<field name="city" xv:id="city">
<prompt>Say "my city" to enter your city.</prompt>
</field>
<field name="code" xv:id="zipcode">
<prompt>Say "my zip code" to enter your zip code.
</prompt>
</field>
<field name="phone" xv:id="phone">
<prompt>
Say "my phone" to enter your phone number.
</prompt>
</field>
</form>
</vxml> |
The registration example has the source in XHTML as shown in Listing 3. The VoiceXML form with id "entryForm" in Listing 2 is referenced as an XML Events handler with the ev:handler attribute placed on the <body> tag. The VoiceXML form is activated with the HTML "load" event, as the ev:event attribute is set to "load."
Listing 3. X+V page with HTML Form
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN"
"http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
<head>
<title>Register Example</title>
<style type="text/css">
td { font-size: 14px; color: 990202; font-family: "sans-serif"; }
hr { height: 5px; color: cacaca; border-style: groove; }
</style>
<!--sync tags synchronize HTML controls with voice fields -->
<xv:sync xv:input="email" xv:field="user/prefs.vxml#email"/>
<xv:sync xv:input="first_name"
xv:field="user/prefs.vxml#first_name"/>
<xv:sync xv:input="last_name"
xv:field="user/prefs.vxml#last_name"/>
<xv:sync xv:input="address" xv:field="user/prefs.vxml#street"/>
<xv:sync xv:input="city" xv:field="user/prefs.vxml#city"/>
<xv:sync xv:input="state" xv:field="user/prefs.vxml#state"/>
<xv:sync xv:input="zipcode"
xv:field="user/prefs.vxml#zipcode"/>
<xv:sync xv:input="phone" xv:field="user/prefs.vxml#phone"/>
</head>
<body ev:event="load"
ev:handler="user/prefs.vxml#entryForm">
<form name="xform" action="">
<table width="400" border="0">
<tbody>
<tr><td colspan="2"><hr/></td></tr>
<tr>
<td align="right">*Email Address:</td>
<td align="left">
<input name="email" type="text" size="30"/>
</td>
</tr>
<tr>
<td align="right">*Confirm Email Address:</td>
<td align="left">
<input name="confirm" type="text" size="30"/>
</td>
</tr>
<tr><td colspan="2" align="left">
Note: Your user I.D. is your email address</td>
</tr>
<tr>
<td align="right">*Choose a Password:</td>
<td align="left">
<input name="pass" type="password" size="10"/>
</td>
</tr>
<tr>
<td align="right">*Confirm Password:</td>
<td align="left">
<input name="pass2" type="password" size="10"/>
</td>
</tr>
<tr><td colspan="2" align="left">
The Password must be a minimum of 6 characters.
</td>
</tr>
<tr><td colspan="2"><hr/></td></tr>
<tr>
<td align="right">*First Name:</td>
<td align="left">
<input name="first_name" type="text" size="30"/>
</td>
</tr>
<tr>
<td align="right">*Last Name:</td>
<td align="left">
<input name="last_name" type="text" size="30"/>
</td>
</tr>
<tr>
<td align="right">*Address:</td>
<td align="left">
<input name="address" type="text" size="30"/>
</td>
</tr>
<tr>
<td align="right">*City:</td>
<td align="left">
<input name="city" type="text" size="15"/>
</td>
</tr>
<tr>
<td align="right">*State:</td>
<td align="left">
<select name="state">
<option value="AL">AL</option>
<option value="FL">FL</option>
<option value="GA">GA</option>
<option value="NJ">NJ</option>
<option value="NY">NY</option>
</select>
</td>
</tr>
<tr>
<td align="right">*Zip Code:</td>
<td align="left">
<input name="zipcode" type="text" size="4" maxlength="5"/>
</td>
</tr>
<tr><td colspan="2"><hr/></td></tr>
<tr>
<td align="right">*Phone:</td>
<td align="left">
<input name="area" type="text" size="2" maxlength="3"/>-
<input name="abc" type="text" size="2" maxlength="3"/>-
<input name="area" type="defg" size="3" maxlength="4"/>
<input name="phone" type="hidden"/>
<input name="submit" type="submit" style="display:none"/>
</td>
</tr>
<tr><td colspan="2"><hr/></td></tr>
</tbody>
</table>
</form>
</body>
</html> |
When you say something that matches the auto-fill grammar, such as "my profile," the VoiceXML fields filled in with your semantic interpretation values update the XHTML input fields synchronized with the X+V <sync> tag. The <sync> tags within the example X+V page are shown again in Listing 4.
Listing 4. X+V sync tags
<!--sync tags synchronize HTML controls with voice fields -->
<xv:sync xv:input="email" xv:field="user/prefs.vxml#email"/>
<xv:sync xv:input="first_name"
xv:field="user/prefs.vxml#first_name"/>
<xv:sync xv:input="last_name"
xv:field="user/prefs.vxml#last_name"/>
<xv:sync xv:input="address" xv:field="user/prefs.vxml#street"/>
<xv:sync xv:input="city" xv:field="user/prefs.vxml#city"/>
<xv:sync xv:input="state" xv:field="user/prefs.vxml#state"/>
<xv:sync xv:input="zipcode" xv:field="user/prefs.vxml#zipcode"/>
<xv:sync xv:input="phone" xv:field="user/prefs.vxml#phone"/> |
The multimodal auto-fill Web service
The multimodal auto-fill Web service lets mobile device users enter and save their personal information using a PC connected to the Internet. The user preferences are stored securely on a Web site, and the auto-fill grammar for each user is generated from the information dynamically upon request. Figure 3 shows the steps from accessing the User Profile Dialog Application using a secure SSL connection to saving the user preferences in a remote database.
Figure 3. Creating user preferences with a PC
Sometime after saving your preferences, you securely log into the user profile application on Web server "A" with your mobile device. At this time, you will find that the application has stored a unique identifier on the device as a cookie.
The cookie is stored, so when you access a multimodal application the Web browser sends the cookie containing the unique identifier to the application. This identifier is forwarded to Web server "A," which maintains the user preferences database. The application presents an XHTML form and references a voice dialog for filling in the form that includes the auto-fill grammar. You fill in the XHTML form by saying a multimodal auto-fill command such as "my profile."
Figure 4 shows the interactions between the small device browser, application server, multimodal application, and the Web server "A" that maintains the user preferences database.
Figure 4. The multimodal auto-fill Web service
Adding multimodal auto-fill to legacy applications
With the addition of an HTTP proxy, multimodal auto-fill can be added dynamically to a legacy Web application. The proxy would sit between your device and the Web application and convert HTML pages into X+V. A converted X+V page would reference the VoiceXML form as shown above. The form would remain unchanged from application to application; only the form’s grammar would be unique. X+V <sync> tags would tie the HTML input control names to the appropriate VoiceXML fields.
In conclusion
Multimodal auto-fill can easily be added as a service that assists you when entering personal information into a Web application. Part 2 will show how multimodal interaction can be used with Web services for finding resources, such as the nearest Italian restaurants, and making a selection.
Download
Resources Learn
Get products and technologies
Discuss
About the author  | |  | Gerald McCobb has worked for IBM for over 14 years. He currently works in the embedded voice development group putting multimodal interaction into small devices. He is also IBM's representative to the W3C Multimodal Interaction Working Group. |
Rate this page
|