Skip to main content

Multimodal interaction and the mobile Web, Part 2: Simple searches with Find-It

How to enable voice access to the Yahoo! local search engine

Marc White (whitemar@us.ibm.com), Advisory Software Engineer, IBM
Marc White is an Advisory Software Engineer for IBM in Boca Raton, Florida. He is a core member of the multimodal development team, which uses XHTML+Voice to voice-enable Web pages for IBM Web browser partners. Marc joined IBM in 1999 to help port the IBM ViaVoice dictation product to Macintosh.

Summary:  Adding multimodal interaction to mobile applications as a Web service enhances your experience by making it easy to fill in forms, locate information, make selections, and access secure applications upon verification. This article, Part 2 of a series, covers the IBM Multimodal Find-It application, which was developed using Extensible HyperText Markup Language (XHTML)+Voice and provides both voice and traditional access to Yahoo! local search engine results.

View more content in this series

Date:  06 Dec 2005
Level:  Intermediate
Comments:  

Searching for information on the Web has become a fast and easy process with the help of powerful search engines like Yahoo! and Google. Having your search results (for, say, a local business) plotted out on your PC screen makes it convenient and useful when determining locations and driving directions.

Accessing search results on a mobile device, on the other hand, proves more difficult. Even as companies tailor their Web applications to look more visually appealing on small devices, there is still the issue of data input. The lack of a full-sized keyboard makes typing text-input controls awkward and slow. Handwriting recognition through the use of a stylus improves the task, but it is not always convenient and, in most cases, requires you to use both hands (one to hold the phone; the other to write on the screen).

A multimodal browser helps solve this problem by adding an additional input mode: voice. Voice recognition allows for a more natural interaction with the device and can be used in single or even hands-free mode.

Multimodal Find-It

IBM Find-It was developed for multimodal-enabled browsers using the XHTML+Voice (X+V) markup language. The application lets you access search engine results from the Yahoo! local search APIs using multiple modes of interaction. For example, if you have a traditional Web browser, you can perform queries using the keyboard or stylus; however, if you have a multimodal browser, you can also perform queries by voice. Imagine looking at an html form on your mobile device and being able to fill in each text field with one statement: "Show me pizza restaurants in Boca Raton, FL."


Figure 1. The Find-It index page
The Find-It index page

The index page shown in Figure 1 features a simple HTML form with two fields. In the first field, you enter your search request. In the second field, you can enter the city, state, or zip code. Because this is a multimodal Web page, you can enter the text by typing, inputting using a stylus, or simply by speaking. With X+V, you accomplish this by embedding a VoiceXML form within the HTML, as Listing 1 shows.


Listing 1. The index page voice form
<vxml:form id="vForm">
  <vxml:grammar src="grm/index.jsgf"/>
  <vxml:block>What are you trying to find?</vxml:block>
  <vxml:field name="stuff">
    <vxml:filled>
      <vxml:assign name="document.xform.hStuff.value" expr="stuff"/>
    </vxml:filled>
  </vxml:field>
  <vxml:field name="where">
    <vxml:filled>
       <vxml:assign name="document.xform.hWhere.value" expr="where"/>
    </vxml:filled>
  </vxml:field>
</vxml:form>

To determine what you can say in a voice form, you need to write the grammar for it. Listing 2 shows the active grammar for the index page. I have removed several items from the <stuff> rule and omitted the <city> and <state> rules to save space, but this should give you a good idea of what can be said.


Listing 2. The index page grammar
#JSGF V1.0 iso-8859-1;
grammar findit;
public <findit> = [<stuffwords>] <stuff> { $.stuff = $stuff }
                  [<wherewords>] <where> { $.where = $where } |
                  [<stuffwords>] <stuff> { $.stuff = $stuff } |
                  [<wherewords>] <where> { $.where = $where };
<stuffwords> = (find | show [me] | locate | search [for]) ;
<wherewords> = (at | in | near [by] | around) ;
<where> = <zipcode> { $=$zipcode } | <city> { $=$city } |
          <city> <state> { $=$city+", "+$state } ;
<stuff> = (Pizza | Chinese | Italian | Barbeque | American | Thai | 
           Mexican | Spanish | Indian | Japanese | Steak | Seafood |
           Steak Houses | Restaurants | [Sports] Bars | [Irish] Pubs | 
           Banks | [Amusement] Parks | Golf [Courses] | ATMs ;
<zipcode> = <d1><d2><d3><d4><d5> { $=$d1+$d2+$d3+$d4+$d5 };
<d1> = <digits> {$=$digits};
<d2> = <digits> {$=$digits};
<d3> = <digits> {$=$digits};
<d4> = <digits> {$=$digits};
<d5> = <digits> {$=$digits};
<digits> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ;

After you have entered your search by voice, typing, or both, the browser submits the form to the Web server, which then returns the results page. Figure 2 shows the results for the query: "Find Golf Courses in Austin, Texas".


Figure 2. The results page for "Golf Courses in Austin, TX"
The results page for "Golf Courses in Austin, TX"

To generate this results page, the server must go through a few steps. First, it extracts the query and location fields from the submitted form and uses them to construct an HTTP request to the Yahoo! local search engine. This returns a set of Extensible Markup Language (XML) data, which describes the returned results as well as how many more results are available on the server. The server then applies an XSLT stylesheet to the result set to build the HTML results table.

The voice parts of the X+V page are also generated with an XSLT stylesheet. This results in a dynamically generated grammar, shown in Listing 3, which allows the user to select one item by its number in the list or by business title.


Listing 3. The dynamic results page grammar
#JSGF V1.0;
grammar results;
public <results> = [select][(go | jump) to] <items> { $=$items };
       <items> = (1 | Hancock Golf Course) {$='link-id-1'} | 
        (2 | Lions Municipal Golf Course) {$='link-id-2'} | 
        (3 | Golf 512) {$='link-id-3'} | 
        (4 | Riverside Golf Course) {$='link-id-4'} | 
        (5 | Golf Links) [the] (home | web) page {$='home-id-5'} | 
        (5 | Golf Links) {$='link-id-5'} | 
        (6 | Morris Williams Golf Course) {$='link-id-6'} | 
        (7 | Peter Pan Mini-Golf) {$='link-id-7'} | 
        (8 | Butler Park Pitch & Putt) {$='link-id-8'} | 
        (9 | Jimmy Clay Golf Course) {$='link-id-9'} | 
        (10 | Roy Kizer Golf Course) {$='link-id-10'} ;
		 		  

Another part of the grammar provides voice access to the navigation controls at the top and bottom of the table, as Listing 4 shows.


Listing 4. Navigation controls grammar
first      [(page|set|results)] { $='nav-id-f' } |
previous   [(page|set|results)] { $='nav-id-p' } |
next       [(page|set|results)] { $='nav-id-n' } |
last       [(page|set|results)] { $='nav-id-l' } |
new search                      { $='nav-id-i' } ;

At this point, you can click on or announce an item in the list and the server returns the details page, shown in Figure 3.


Figure 3. The details page for "Golf Links"
The details page for "Golf Links"

The HTML and voice form of the details page are also generated using XSLT stylesheets. Listing 5 shows the voice form with inline grammar; this allows you to use voice to navigate the page's links.


Listing 5. The details page voice form with grammar
<form xmlns="http://www.w3.org/2001/vxml" id="vForm">
  <block>Golf Links</block>
  <grammar><![CDATA[
    #JSGF V1.0;
    grammar nav;
    public <nav> = [select][(go | jump) to] <item> { $.nav=$item };
      <item> = new search {$='nav-id-i'} | 
               back to results {$='nav-id-b'} |
               (business|Golf Links) [u r l][web|home]
               [page|site] {$='home-id-b'} |
               Yahoo Details [web|home][page] {$='home-id-y'} |
               Yahoo Map [web|home][page] {$='home-id-m'} ;
  ]]></grammar>
  <field name="nav">
    <filled>
      <value expr="document.getElementById(nav).click()"/>
      <clear namelist="nav"/>
    </filled>
  </field>
</form>


In conclusion

As browser companies provide access to their search engines through public APIs, Web authors can use X+V to add an additional mode of input to these engines: voice. Using voice to access information on the Web provides an important and convenient alternative to traditional modes of input, especially for users of small mobile devices. Part 3 of this series will cover a multimodal Web service that securely authenticates the user by matching speech input against a stored voice print.



Download

Resources

Learn

Get products and technologies

Discuss

About the author

Marc White is an Advisory Software Engineer for IBM in Boca Raton, Florida. He is a core member of the multimodal development team, which uses XHTML+Voice to voice-enable Web pages for IBM Web browser partners. Marc joined IBM in 1999 to help port the IBM ViaVoice dictation product to Macintosh.

Comments



Trademarks

static.content.url=/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and Web services
ArticleID=99827
ArticleTitle=Multimodal interaction and the mobile Web, Part 2: Simple searches with Find-It
publish-date=12062005
author1-email=whitemar@us.ibm.com
author1-email-cc=htc@us.ibm.com