As mobile devices gain in popularity, more and more Web applications are being directed toward mobile users. However, traditional small-screen input methods, such as the keyboard and stylus, don't always make it easy to navigate and fill in a form. Multimodal technology, which includes speech-recognition and speech-synthesis technology, aims to resolve this problem.
Multimodal access is the ability to combine multiple modes or channels in the same interaction or session. The methods of input can include speech recognition, a keyboard, a touch screen, and a stylus. Depending on the situation and the device, a combination of input modes can make a small device easier to use. For example, in a Web browser on a PDA, you could select items by tapping or by providing spoken input. Similarly, you could use voice or a stylus to enter information into a field. With multimodal technology, information on the device can be both displayed and spoken.
Multimodal applications using XML+Voice (X+V) offer a natural migration path from today's VoiceXML-based voice applications and XHTML-based visual applications to a single application that can serve both of these environments as well as multimodal ones. For developers who have Web application experience, it's not difficult to develop multimodal applications. The architecture of multimodal applications is similar to Web applications, where the browser sends a request to the server, and the server responds with pages (see Figure 1).
Figure 1. Multimodal application architecture
The difference is that multimodal applications generate X+V pages, while traditional Web applications generate HTML pages. Multimodal applications must support X+V pages in the browser, including parsing and showing the pages. They also must run the call flow defined by VoiceXML, recognize the user's voice input, synthesize the audio, and play the audio to the user. Numerous multimodal browsers exist in the market, some of which offer free trial versions (see Resources).
More and more people are traveling to large international cities as both tourists and workers, often even setting up temporary or permanent residence. However, language can be a challenge; if these foreigners are not able to speak the local language, they can have problems communicating with the local people.
The IBM WebSphere Translation Server, which provides multilingual translation, supports English translation from and to French, Italian, Spanish, Portuguese, Chinese, Japanese, and Korean. Integrating this translation ability with mobile devices can help English speakers in their daily lives. The dictionary can even be customized to improve the quality of the translation.
Let's say Tom from the United States visits China, and he can't speak Chinese. When he leaves the airport, he wants to go to the Great Wall Hotel. However, the taxi driver can't understand English. Fortunately, the multimodal translator gives Tom support.
First, Tom accesses an X+V page with his mobile phone, then the multimodal page receives Tom's speech input. After the application recognizes what Tom says, it submits the result to the application server, and the application server sends a request to the translation server. After the application server receives the translated Chinese text, it generates a new X+V page, which contains the translated text, and it sends it to Tom's mobile phone. The multimodal browser synthesizes the Chinese text and plays it to the taxi driver. Similarly, the application helps Tom understand what the taxi driver is saying.
Multimodal application development is similar to traditional Web application development. The difference is that with general Web applications, the page sent from the server to the client is an HTML page, and with multimodal applications, the page sent from the server to the client is an X+V page. In my implementation for this system, I used Eclipse 3.1.1 as the developing tool, and I deployed the application on an Apache Geronimo 1.0 server.
The example application's simple interface includes a text box and a Submit button. Figure 2 shows how the system interface appears on the multimodal browser, NetFront.
Figure 2. Multimodal translator interface
After the browser receives this page, users can fill in the form with speech. For example, Tom can say, "Please take me to the Great Wall Hotel." The multimodal browser recognizes the speech input and fills in the text box. VoiceXML is embedded in the XHTML page to enable Automatic Speech Recognition (ASR) functions, as Listing 1 shows.
Listing 1. VoiceXML in the XHTML page
<vxml:form id="vForm">
<vxml:property name="confidencelevel" value="0.0"/>
<vxml:block> <%=output %></vxml:block>
<vxml:field name="input">
<vxml:grammar>
<![CDATA[
#JSGF V1.0;
grammar english;
public <english> = [<greeting>] <take>|<how>|<near>;
<greeting> = Hello | Good (morning | afternoon | day | evening |night)
| yes please | Thank you very much |My pleasure| Excuse me |Help;
<take> = please take me to <place> ;
<how> = How do I get to the <place> ;
<near> = Is there [a] <place> near by ;
<place> = Great Wall hotel | Holiday hotel | hotel | Summar Palace | Forbidden city
| airport | station |bus station |metro station | subway station | police station
|post office |baker |bank |bar |bus stop | cafe |cake shop |hospital;
]]>
</vxml:grammar>
<vxml:filled>
<vxml:assign name="document.xform.input.value" expr="input"/>
</vxml:filled>
</vxml:field>
</vxml:form>
|
The grammar defines what speech input from users the system can understand. The example is simple, but you can customize the grammar to meet other scenarios and requirements. Unfortunately, no matter how perfect the grammar is, it's impossible for the system to understand users' speech input correctly 100 percent of the time. However, users can still use a keyboard or stylus to enter or correct the text. This showcases the advantage of multimodal systems, because they provide a convenient and natural voice interface but are not limited to voice input and output. When speech recognition doesn't work well, keyboard and handwriting recognition are effective supplements.
After the system recognizes the user's speech input, it prompts if the text is correct. The user can say "Yes" or "OK," or can click the Translate button to submit an HTTP request to the application server. (If the text is incorrect, the user can refresh the page and speak again, or the user can edit the text with a keyboard or stylus.) The input text is a parameter in the HTTP request. The application server receives the request, gets the text, and translates it to Chinese. The translation server translates the text, and a Java™Bean component, also known as a translation bean, communicates with the translation server to send the translation request and receive the result. Meanwhile, the application records the dialog.
The main functions of the translation bean are communicating with WebSphere Translation Server, invoking the Java API provided by WebSphere Translation Server, and accomplishing the translation between English and Chinese. At the same time, the JavaBean component makes a dialog record of the translation text in both Chinese and English. The record is shown to users. The JavaServer Pages (JSP) code, <jsp:useBean id="translate" class="ibm.Translate" scope="session"/>, shows that the life cycle of the bean is session, so it records the text in one dialog.
WebSphere Translation Server provides two sets of APIs -- one that supports Java code and one that supports C. I used the Java API in the example (see Listing 2).
Listing 2. JavaBean construct code
public Translate(){
record_cn = new ArrayList();
record_en = new ArrayList();
try{
service_cnen = (LTinterface) LTengine.GetService("wtsserver.ibm.com", "cnen");
service_encn = (LTinterface) LTengine.GetService("wtsserver.ibm.com ", "encn");
}catch (Throwable t){
t.printStackTrace();
System.out.println("no service available");
}
}
|
The construct function of the translation bean initializes the translation service both for Chinese to English and English to Chinese. You can replace the server name of WebSphere Translation Server, wtsserver, with the IP address of WebSphere Translation Server. You use record_cn and record_en to save the dialog record. Listing 3 shows the trans function of the JavaBean component, which translates the text and records it.
Listing 3. JavaBean code receives the translation result from the translation server
public String trans(String lang, String input) {
LTinterface service = null;
String answer = null;
if (lang.compareTo("cnen")==0)
service = service_cnen;
else
service = service_encn;
try{
handle=service.jltBeginTranslation("*format=text");
answer=service.jltTranslate(handle,input);
service.jltEndTranslation(handle);
}catch (Exception e){
e.printStackTrace();
}
if (lang.compareTo("cnen")==0){
record_cn.add(0, input);
record_en.add(0, answer);
}
else{
record_en.add(0, input);
record_cn.add(0, answer);
}
return answer;
}
|
In the implementation, the JavaBean component chooses LTinterface according to the language type; invokes jltBeginTranslation, jlttranslate, and jltEndTranslation to accomplish the text translation; and then records the text.
This JavaBean component has another function, getrecord, which is used to get the dialog record. You can get the source code to see the detailed implementation of this function. This bean is used in JSP pages. After the user inputs and submits a request, it is forwarded to a new JSP page. The JSP page uses the JavaBean component and receives the translated text, as Listing 4 shows.
Listing 4. JSP page invokes the JavaBean component and gets the translation result
<%
String input = request.getParameter("input");
String output = null;
if (input == null || input.trim().compareTo("")==0)
output = "";
else
output = translate.trans("encn",input);
%>
|
With the code <%=translate.getrecord("cn")%>, the application shows the dialog text recorded in this session. Figure 3 shows the new page.
Figure 3. Translation result page
The JSP page contains the same VoiceXML code as shown in Listing 1:
<vxml:block> <%=output %></vxml:block>
This code lets the multimodal browser synthesize and play the translated text, so the taxi driver can see and hear what Tom says (in Chinese). Similarly, the taxi driver can speak Chinese, and the application translates the speech to English and plays it for Tom.
Multimodal applications can enrich the interaction between humans and machines. With the help of a multimodal translator, a machine can both receive a user's speech input and deliver information to the user with speech. Using the simple API provided by WebSphere Translation Server, the application can accomplish translation between languages, and you can communicate effectively with those who speak a different language.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample code for this article | translate.war | 6KB | HTTP |
Information about download methods
Learn
-
XHTML+Voice profile 1.2 specification: Learn more about this specification.
-
Voice Extensible Markup Language (VoiceXML) Version 2.0 specification: Read about this specification on W3C.
-
Check out the "Multimodal interaction and the mobile Web" series, including "Part 1: Multimodal auto-fill" (developerWorks, November 2005), "Part 2: Simple searches with Find-It" (developerWorks, December 2005), and "Part 3: User authentication" (developerWorks, January 2006).
-
WebSphere Translation Server: Learn more about the WebSphere Translation Server.
Get products and technologies
-
IBM Multimodal site: Download the Multimodal Browser and the Multimodal Toolkit from the Multimodal site.
Discuss
-
developerWorks blogs: Participate and get involved in the developerWorks community.




