Searching for information on the Web has become a fast and easy process with the help of powerful search engines like Yahoo! and Google. Having your search results (for, say, a local business) plotted out on your PC screen makes it convenient and useful when determining locations and driving directions.
Accessing search results on a mobile device, on the other hand, proves more difficult. Even as companies tailor their Web applications to look more visually appealing on small devices, there is still the issue of data input. The lack of a full-sized keyboard makes typing text-input controls awkward and slow. Handwriting recognition through the use of a stylus improves the task, but it is not always convenient and, in most cases, requires you to use both hands (one to hold the phone; the other to write on the screen).
A multimodal browser helps solve this problem by adding an additional input mode: voice. Voice recognition allows for a more natural interaction with the device and can be used in single or even hands-free mode.
IBM Find-It was developed for multimodal-enabled browsers using the XHTML+Voice (X+V) markup language. The application lets you access search engine results from the Yahoo! local search APIs using multiple modes of interaction. For example, if you have a traditional Web browser, you can perform queries using the keyboard or stylus; however, if you have a multimodal browser, you can also perform queries by voice. Imagine looking at an html form on your mobile device and being able to fill in each text field with one statement: "Show me pizza restaurants in Boca Raton, FL."
Figure 1. The Find-It index page

The index page shown in Figure 1 features a simple HTML form with two fields. In the first field, you enter your search request. In the second field, you can enter the city, state, or zip code. Because this is a multimodal Web page, you can enter the text by typing, inputting using a stylus, or simply by speaking. With X+V, you accomplish this by embedding a VoiceXML form within the HTML, as Listing 1 shows.
Listing 1. The index page voice form
<vxml:form id="vForm">
<vxml:grammar src="grm/index.jsgf"/>
<vxml:block>What are you trying to find?</vxml:block>
<vxml:field name="stuff">
<vxml:filled>
<vxml:assign name="document.xform.hStuff.value" expr="stuff"/>
</vxml:filled>
</vxml:field>
<vxml:field name="where">
<vxml:filled>
<vxml:assign name="document.xform.hWhere.value" expr="where"/>
</vxml:filled>
</vxml:field>
</vxml:form> |
To determine what you can say in a voice form, you need to write the grammar for it. Listing 2 shows the active grammar for the index page. I have removed several items from the <stuff> rule and omitted the <city> and <state> rules to save space, but this should give you a good idea of what can be said.
Listing 2. The index page grammar
#JSGF V1.0 iso-8859-1;
grammar findit;
public <findit> = [<stuffwords>] <stuff> { $.stuff = $stuff }
[<wherewords>] <where> { $.where = $where } |
[<stuffwords>] <stuff> { $.stuff = $stuff } |
[<wherewords>] <where> { $.where = $where };
<stuffwords> = (find | show [me] | locate | search [for]) ;
<wherewords> = (at | in | near [by] | around) ;
<where> = <zipcode> { $=$zipcode } | <city> { $=$city } |
<city> <state> { $=$city+", "+$state } ;
<stuff> = (Pizza | Chinese | Italian | Barbeque | American | Thai |
Mexican | Spanish | Indian | Japanese | Steak | Seafood |
Steak Houses | Restaurants | [Sports] Bars | [Irish] Pubs |
Banks | [Amusement] Parks | Golf [Courses] | ATMs ;
<zipcode> = <d1><d2><d3><d4><d5> { $=$d1+$d2+$d3+$d4+$d5 };
<d1> = <digits> {$=$digits};
<d2> = <digits> {$=$digits};
<d3> = <digits> {$=$digits};
<d4> = <digits> {$=$digits};
<d5> = <digits> {$=$digits};
<digits> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ; |
After you have entered your search by voice, typing, or both, the browser submits the form to the Web server, which then returns the results page. Figure 2 shows the results for the query: "Find Golf Courses in Austin, Texas".
Figure 2. The results page for "Golf Courses in Austin, TX"

To generate this results page, the server must go through a few steps. First, it extracts the query and location fields from the submitted form and uses them to construct an HTTP request to the Yahoo! local search engine. This returns a set of Extensible Markup Language (XML) data, which describes the returned results as well as how many more results are available on the server. The server then applies an XSLT stylesheet to the result set to build the HTML results table.
The voice parts of the X+V page are also generated with an XSLT stylesheet. This results in a dynamically generated grammar, shown in Listing 3, which allows the user to select one item by its number in the list or by business title.
Listing 3. The dynamic results page grammar
#JSGF V1.0;
grammar results;
public <results> = [select][(go | jump) to] <items> { $=$items };
<items> = (1 | Hancock Golf Course) {$='link-id-1'} |
(2 | Lions Municipal Golf Course) {$='link-id-2'} |
(3 | Golf 512) {$='link-id-3'} |
(4 | Riverside Golf Course) {$='link-id-4'} |
(5 | Golf Links) [the] (home | web) page {$='home-id-5'} |
(5 | Golf Links) {$='link-id-5'} |
(6 | Morris Williams Golf Course) {$='link-id-6'} |
(7 | Peter Pan Mini-Golf) {$='link-id-7'} |
(8 | Butler Park Pitch & Putt) {$='link-id-8'} |
(9 | Jimmy Clay Golf Course) {$='link-id-9'} |
(10 | Roy Kizer Golf Course) {$='link-id-10'} ;
|
Another part of the grammar provides voice access to the navigation controls at the top and bottom of the table, as Listing 4 shows.
Listing 4. Navigation controls grammar
first [(page|set|results)] { $='nav-id-f' } |
previous [(page|set|results)] { $='nav-id-p' } |
next [(page|set|results)] { $='nav-id-n' } |
last [(page|set|results)] { $='nav-id-l' } |
new search { $='nav-id-i' } ;
|
At this point, you can click on or announce an item in the list and the server returns the details page, shown in Figure 3.
Figure 3. The details page for "Golf Links"

The HTML and voice form of the details page are also generated using XSLT stylesheets. Listing 5 shows the voice form with inline grammar; this allows you to use voice to navigate the page's links.
Listing 5. The details page voice form with grammar
<form xmlns="http://www.w3.org/2001/vxml" id="vForm">
<block>Golf Links</block>
<grammar><![CDATA[
#JSGF V1.0;
grammar nav;
public <nav> = [select][(go | jump) to] <item> { $.nav=$item };
<item> = new search {$='nav-id-i'} |
back to results {$='nav-id-b'} |
(business|Golf Links) [u r l][web|home]
[page|site] {$='home-id-b'} |
Yahoo Details [web|home][page] {$='home-id-y'} |
Yahoo Map [web|home][page] {$='home-id-m'} ;
]]></grammar>
<field name="nav">
<filled>
<value expr="document.getElementById(nav).click()"/>
<clear namelist="nav"/>
</filled>
</field>
</form> |
As browser companies provide access to their search engines through public APIs, Web authors can use X+V to add an additional mode of input to these engines: voice. Using voice to access information on the Web provides an important and convenient alternative to traditional modes of input, especially for users of small mobile devices. Part 3 of this series will cover a multimodal Web service that securely authenticates the user by matching speech input against a stored voice print.
Download
- Demo: IBM Multimodal Find-It
Learn
- Read the XHTML + Voice 1.2 and Mobile X+V 1.2 Specifications on the VoiceXML Forum Web site.
-
From W3C.org, you can get the VoiceXML 2.0 Specification, the Speech Recognition Grammar Specification 1.0, the Semantic Interpretation for Speech Recognition 1.0 Draft Specification, and the XHTML 1.0 Specification.
- Read more about the W3C's Multimodal Interaction Activity.
Get products and technologies
- Download the Multimodal browser and toolkit from the IBM Multimodal technologies site.
- Download Yahoo! APIs from Yahoo! Developer Network.
Discuss
- Get involved in the developerWorks community; visit the developerWorks blogs.
