Skip to main content

Multimodal interaction and the mobile Web, Part 3: User authentication

Secure user authentication with voice and visual interaction

Gerald McCobb (mccobb@us.ibm.com), Advisory Software Engineer, IBM
Gerald McCobb has worked for IBM for over 14 years. He currently works in the embedded voice development group putting multimodal interaction into small devices. He is also IBM's representative to the W3C Multimodal Interaction Working Group.

Summary:  User authentication is an essential feature of transactional applications, including those for the mobile Web. See how you can create a multimodal user authentication service for use by mobile device applications.

View more content in this series

Date:  10 Jan 2006
Level:  Intermediate
Comments:  

In "Multimodal interaction and the mobile Web, Part 1," I introduced a typical scenario for a multimodal mobile application: a user who wants to use a cell phone to find a restaurant, view a menu, order a meal, and pay for it. Over the Internet, you would normally secure this type of transaction by having the user send a unique name and password over a secure (SSL or TLS) connection to the application. Unfortunately, user names and passwords are much less convenient for the wireless device user, being harder to enter and more frequently forgotten. It's also not a good idea to store such information in a mobile device's Web browser, because such devices are easily lost or stolen.

User authentication is an essential aspect of securing your multimodal interactions for mobile devices, and implementing it through a Web service is both easy and sensible. In this article, I'll introduce you to a user-configurable multimodal Web service that lets you securely authenticate users. Like other examples in this series, the user authentication service is written in the XHTML+Voice (X+V) multimodal markup language.

Multimodal user authentication

Why Web services?

Creating multimodal interactions as configurable Web services frees you from being concerned with the general problem of adding multimodal interaction to applications, while allowing you to develop simple solutions that enhance user experience. A multimodal user authentication service makes it easy for users to be authenticated, even when using small wireless devices to access Web applications. As a result, users are more likely to use their small devices to run Web applications, and will at last take advantage of the high-bandwidth networks being deployed by wireless carriers.

Multimodal user authentication is based on voice authentication, a biometric technology that analyzes a speaker's utterance to find his or her unique vocal characteristics. When a user first enrolls with the service he or she speaks one or more paragraphs. The analyzed recording of these paragraphs containing the speaker's unique vocal characteristics is stored as a voice print. The voice print is made available later for comparison with a recording entered when the user is asked to be authenticated. Current voice authentication technology is reliable, fast, and cost-effective and several vendors now supply the technology commercially.

A multimodal authentication interface helps the user both when enrolling over the Internet with a voice authentication system and later when asked by an application for proof of identity. Enrollment is easier because the paragraphs to be spoken are presented as visual text. Unlike enrollment over the telephone, for example, the user doesn't have to remember the paragraphs, but can read them instead. Later, when the user is to be authenticated, he or she is asked to speak a random group of words or numbers, which are also presented visually. The words are randomly chosen each time the authentication page is presented so that a tape recording of the user's voice cannot be used to fool the application.


The authentication service

A user can enroll from either a desktop PC or a small client device. The desktop PC is recommended because it is generally in a more comfortable, safe, and quiet environment. The user first registers with the multimodal authentication service with a unique user identification (ID). After successfully registering with the service, the user enrolls by speaking one or more paragraphs recorded as audio data, for example as Pulse Code Modulation (PCM) format data.

The audio data may be either stored first on the PC and then submitted with the enrollment Web page or streamed to the service while the user is speaking into the microphone. The voice authentication service extracts from the raw audio data a voice print that contains meaningful characteristics unique to the user's voice. The voice print is then stored in a database along with the user's supplied unique ID. Figure 1 shows the steps from accessing the enrollment page of the authentication service using a secure connection to saving the analyzed audio recording in a remote voice print database.


Figure 1. Creating a voice print with a PC
Creating a voice print

A user calling into a Web application to place an order (as in the case of the restaurant example) would first be presented with the authentication service; for example, the Internet address (URI) of the service could be included in a "cookie" sent from the user's Web browser, or it could be selected by the user from a list of well-known authentication providers. In Figure 2 you can see the interactions between the small client device and the Web application and the Web application and the authentication service.


Figure 2. User authentication with a small device
User authentication with a small device

When the Web application accesses the remote authentication service it receives a list of random words to present to the user to speak. Figure 3 shows an example user authentication Web page, login.xhtml, presented by the Opera multimodal browser that asks the user to read a list of city names.


Figure 3. Login Web page with voice authentication
Login Web page with voice authentication

The X+V user login page

Listing 1 shows the X+V source of the authentication Web page. Note that the VoiceXML <record> tag performs the actual recording.


Listing 1. X+V user login form
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml" 
      xmlns:vxml="http://www.w3.org/2001/vxml" 
      xmlns:ev="http://www.w3.org/2001/xml-events"
      xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
<head><title>Voice Verification Login</title>

  <style type="text/css">
h4 { font-size: 18px; color: #990202; font-family: "sans-serif"; }
td, p { font-size: 14px; color: #990202; font-family: "sans-serif"; }
hr { height: 5px; color: #990202; border-style: groove; }
p.box { border: 2px solid #0077ee; margin: 1px 1px 1px 1px; 
  padding: 10px 12px 10px 12px; font-size: 16px; }
input { border: 2px groove #990202; margin: 2px 4px 2px 4px; 
  padding: 4px 6px 4px 6px; font-size: 16px; }
  </style>

  <!-- voice handler -->
  <vxml:form id="voice_record">
     <vxml:record name="recording">
       <vxml:prompt timeout="10s" xv:src="#intro"/>
       <vxml:filled>
         <vxml:assign name="document.recordForm.record.value" expr="recording"/>
       </vxml:filled>
     </vxml:record>
  </vxml:form>

</head>
<body id="mainbody" ev:event="load" ev:handler="#voice_record">
 
  <form id="recordForm" action="jsps/login.jsp" method="post" 
    enctype="multipart/form-data">
    <h4>Application Login Instructions</h4>
    <p id="intro">Please enter your account number and speak the 
    cities from left to right in the
       verification box.  Thank you.
    </p>
    <table width="40%">
	<tbody>
	  <tr><td colspan="2">	<hr/></td></tr>
	  <tr><td>Account Number</td>
	      <td><input type="text" name="accountno"/>
		  <input type="file" name="record" id="record" 
		    style="display:none"/>
              </td>
	  </tr>
	  <tr><td>Verification Box</td>
              <td><p class="box">
Las Vegas Detroit Budapest Miami London Bismarck
              </p></td>
          </tr>
          <tr><td colspan="2"><hr/></td></tr>
	  <tr><td colspan="2">
            <input style="border-style: outset" type="submit" value="Login" 
              name="Submit"/>
              </td>
	  </tr>
	</tbody>
    </table>
  </form>
</body>
</html>

The authentication service extracts the physical characteristics from the recording of the random words received from the user. These characteristics are compared with the voice print stored in its database with the supplied user ID. If the two sets of physical characteristics match, the Web application is informed that the user can proceed to the next step of the transaction. Otherwise, the user is denied access to the application.


In conclusion

As you've seen in this and previous articles in this series, it's relatively simple to provide multimodal interaction to a Web application as a Web service. In this case, you've learned how multimodal user authentication works and seen the underlying X+V code for the voice-driven user login page.



Download

DescriptionNameSizeDownload method
Sample code from this articlewi-mobweb3.zip3KB HTTP

Information about download methods


Resources

Learn

Get products and technologies

Discuss

About the author

Gerald McCobb has worked for IBM for over 14 years. He currently works in the embedded voice development group putting multimodal interaction into small devices. He is also IBM's representative to the W3C Multimodal Interaction Working Group.

Comments



Trademarks

static.content.url=/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and Web services
ArticleID=101493
ArticleTitle=Multimodal interaction and the mobile Web, Part 3: User authentication
publish-date=01102006
author1-email=mccobb@us.ibm.com
author1-email-cc=