In "Multimodal interaction and the mobile Web, Part 1," I introduced a typical scenario for a multimodal mobile application: a user who wants to use a cell phone to find a restaurant, view a menu, order a meal, and pay for it. Over the Internet, you would normally secure this type of transaction by having the user send a unique name and password over a secure (SSL or TLS) connection to the application. Unfortunately, user names and passwords are much less convenient for the wireless device user, being harder to enter and more frequently forgotten. It's also not a good idea to store such information in a mobile device's Web browser, because such devices are easily lost or stolen.
User authentication is an essential aspect of securing your multimodal interactions for mobile devices, and implementing it through a Web service is both easy and sensible. In this article, I'll introduce you to a user-configurable multimodal Web service that lets you securely authenticate users. Like other examples in this series, the user authentication service is written in the XHTML+Voice (X+V) multimodal markup language.
Multimodal user authentication
Multimodal user authentication is based on voice authentication, a biometric technology that analyzes a speaker's utterance to find his or her unique vocal characteristics. When a user first enrolls with the service he or she speaks one or more paragraphs. The analyzed recording of these paragraphs containing the speaker's unique vocal characteristics is stored as a voice print. The voice print is made available later for comparison with a recording entered when the user is asked to be authenticated. Current voice authentication technology is reliable, fast, and cost-effective and several vendors now supply the technology commercially.
A multimodal authentication interface helps the user both when enrolling over the Internet with a voice authentication system and later when asked by an application for proof of identity. Enrollment is easier because the paragraphs to be spoken are presented as visual text. Unlike enrollment over the telephone, for example, the user doesn't have to remember the paragraphs, but can read them instead. Later, when the user is to be authenticated, he or she is asked to speak a random group of words or numbers, which are also presented visually. The words are randomly chosen each time the authentication page is presented so that a tape recording of the user's voice cannot be used to fool the application.
A user can enroll from either a desktop PC or a small client device. The desktop PC is recommended because it is generally in a more comfortable, safe, and quiet environment. The user first registers with the multimodal authentication service with a unique user identification (ID). After successfully registering with the service, the user enrolls by speaking one or more paragraphs recorded as audio data, for example as Pulse Code Modulation (PCM) format data.
The audio data may be either stored first on the PC and then submitted with the enrollment Web page or streamed to the service while the user is speaking into the microphone. The voice authentication service extracts from the raw audio data a voice print that contains meaningful characteristics unique to the user's voice. The voice print is then stored in a database along with the user's supplied unique ID. Figure 1 shows the steps from accessing the enrollment page of the authentication service using a secure connection to saving the analyzed audio recording in a remote voice print database.
Figure 1. Creating a voice print with a PC

A user calling into a Web application to place an order (as in the case of the restaurant example) would first be presented with the authentication service; for example, the Internet address (URI) of the service could be included in a "cookie" sent from the user's Web browser, or it could be selected by the user from a list of well-known authentication providers. In Figure 2 you can see the interactions between the small client device and the Web application and the Web application and the authentication service.
Figure 2. User authentication with a small device

When the Web application accesses the remote authentication service it receives a list of random words to present to the user to speak. Figure 3 shows an example user authentication Web page, login.xhtml, presented by the Opera multimodal browser that asks the user to read a list of city names.
Figure 3. Login Web page with voice authentication

Listing 1 shows the X+V source of the authentication Web page. Note that the VoiceXML <record> tag performs the actual recording.
Listing 1. X+V user login form
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
<head><title>Voice Verification Login</title>
<style type="text/css">
h4 { font-size: 18px; color: #990202; font-family: "sans-serif"; }
td, p { font-size: 14px; color: #990202; font-family: "sans-serif"; }
hr { height: 5px; color: #990202; border-style: groove; }
p.box { border: 2px solid #0077ee; margin: 1px 1px 1px 1px;
padding: 10px 12px 10px 12px; font-size: 16px; }
input { border: 2px groove #990202; margin: 2px 4px 2px 4px;
padding: 4px 6px 4px 6px; font-size: 16px; }
</style>
<!-- voice handler -->
<vxml:form id="voice_record">
<vxml:record name="recording">
<vxml:prompt timeout="10s" xv:src="#intro"/>
<vxml:filled>
<vxml:assign name="document.recordForm.record.value" expr="recording"/>
</vxml:filled>
</vxml:record>
</vxml:form>
</head>
<body id="mainbody" ev:event="load" ev:handler="#voice_record">
<form id="recordForm" action="jsps/login.jsp" method="post"
enctype="multipart/form-data">
<h4>Application Login Instructions</h4>
<p id="intro">Please enter your account number and speak the
cities from left to right in the
verification box. Thank you.
</p>
<table width="40%">
<tbody>
<tr><td colspan="2"> <hr/></td></tr>
<tr><td>Account Number</td>
<td><input type="text" name="accountno"/>
<input type="file" name="record" id="record"
style="display:none"/>
</td>
</tr>
<tr><td>Verification Box</td>
<td><p class="box">
Las Vegas Detroit Budapest Miami London Bismarck
</p></td>
</tr>
<tr><td colspan="2"><hr/></td></tr>
<tr><td colspan="2">
<input style="border-style: outset" type="submit" value="Login"
name="Submit"/>
</td>
</tr>
</tbody>
</table>
</form>
</body>
</html>
|
The authentication service extracts the physical characteristics from the recording of the random words received from the user. These characteristics are compared with the voice print stored in its database with the supplied user ID. If the two sets of physical characteristics match, the Web application is informed that the user can proceed to the next step of the transaction. Otherwise, the user is denied access to the application.
As you've seen in this and previous articles in this series, it's relatively simple to provide multimodal interaction to a Web application as a Web service. In this case, you've learned how multimodal user authentication works and seen the underlying X+V code for the voice-driven user login page.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample code from this article | wi-mobweb3.zip | 3KB | HTTP |
Information about download methods
Learn
- Multimodal interaction and the mobile Web, Part 1: Multimodal auto-fill (developerWorks, November 2005): Extend a Web browser's auto-fill capabilities with voice interaction.
- Multimodal interaction and the mobile Web, Part 2: Simple searches with Find-It (developerWorks, December 2005): Enable voice access to a local search engine.
- Designing mobile Web services (developerWorks, January 2006): Learn more about crafting mobile Web services.
- VoiceXML Forum: Read the XHTML + Voice 1.2 and Mobile X+V 1.2 specifications.
- W3C.org: Home of the VoiceXML 2.0 specification,
the Speech Recognition Grammar specification 1.0, the Semantic Interpretation for Speech Recognition 1.0 draft specification, and the XHTML 1.0 specification.
- W3C's Multimodal Interaction Activity page: Includes updates on the current situation, activity status, and working group information.
Get products and technologies
- Opera: A multimodal browser.
- IBM Multimodal technologies: Download IBM's multimodal browser and toolkit.
Discuss
- developerWorks blogs: Get involved in the developerWorks community.
