Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Improve mobile communication with a multimodal translator

Web apps provide voice and visual translation services

Qi KeKe (qikeke@cn.ibm.com), Staff Software Engineer, IBM
Qi KeKe has worked for IBM for more than five years. He currently works with voice technology in the IBM China Development Laboratory (CDL) and is part of the Pervasive Computing (PvC) Voice and Radio Frequency Identification (RFID) Solution team for customer support and service.

Summary:  Adding multimodal interaction to mobile applications enhances your experience by making it easier to interact with machines, providing voice input for translation, and getting translation results. This article introduces an automatic language translator created with IBM® WebSphere® Translation Server that can run on mobile phones or Personal Digital Assistants (PDAs).

Date:  03 Oct 2006
Level:  Introductory
Also available in:   Chinese

Activity:  7929 views
Comments:  

As mobile devices gain in popularity, more and more Web applications are being directed toward mobile users. However, traditional small-screen input methods, such as the keyboard and stylus, don't always make it easy to navigate and fill in a form. Multimodal technology, which includes speech-recognition and speech-synthesis technology, aims to resolve this problem.

Multimodal access is the ability to combine multiple modes or channels in the same interaction or session. The methods of input can include speech recognition, a keyboard, a touch screen, and a stylus. Depending on the situation and the device, a combination of input modes can make a small device easier to use. For example, in a Web browser on a PDA, you could select items by tapping or by providing spoken input. Similarly, you could use voice or a stylus to enter information into a field. With multimodal technology, information on the device can be both displayed and spoken.

Multimodal applications using XML+Voice (X+V) offer a natural migration path from today's VoiceXML-based voice applications and XHTML-based visual applications to a single application that can serve both of these environments as well as multimodal ones. For developers who have Web application experience, it's not difficult to develop multimodal applications. The architecture of multimodal applications is similar to Web applications, where the browser sends a request to the server, and the server responds with pages (see Figure 1).


Figure 1. Multimodal application architecture
multimodal application architecture

The difference is that multimodal applications generate X+V pages, while traditional Web applications generate HTML pages. Multimodal applications must support X+V pages in the browser, including parsing and showing the pages. They also must run the call flow defined by VoiceXML, recognize the user's voice input, synthesize the audio, and play the audio to the user. Numerous multimodal browsers exist in the market, some of which offer free trial versions (see Resources).

WebSphere Translation Server

More and more people are traveling to large international cities as both tourists and workers, often even setting up temporary or permanent residence. However, language can be a challenge; if these foreigners are not able to speak the local language, they can have problems communicating with the local people.

The IBM WebSphere Translation Server, which provides multilingual translation, supports English translation from and to French, Italian, Spanish, Portuguese, Chinese, Japanese, and Korean. Integrating this translation ability with mobile devices can help English speakers in their daily lives. The dictionary can even be customized to improve the quality of the translation.


Multimodal translator

Let's say Tom from the United States visits China, and he can't speak Chinese. When he leaves the airport, he wants to go to the Great Wall Hotel. However, the taxi driver can't understand English. Fortunately, the multimodal translator gives Tom support.

First, Tom accesses an X+V page with his mobile phone, then the multimodal page receives Tom's speech input. After the application recognizes what Tom says, it submits the result to the application server, and the application server sends a request to the translation server. After the application server receives the translated Chinese text, it generates a new X+V page, which contains the translated text, and it sends it to Tom's mobile phone. The multimodal browser synthesizes the Chinese text and plays it to the taxi driver. Similarly, the application helps Tom understand what the taxi driver is saying.

Multimodal application development is similar to traditional Web application development. The difference is that with general Web applications, the page sent from the server to the client is an HTML page, and with multimodal applications, the page sent from the server to the client is an X+V page. In my implementation for this system, I used Eclipse 3.1.1 as the developing tool, and I deployed the application on an Apache Geronimo 1.0 server.

System interface

The example application's simple interface includes a text box and a Submit button. Figure 2 shows how the system interface appears on the multimodal browser, NetFront.


Figure 2. Multimodal translator interface
Multimodal translator interface

After the browser receives this page, users can fill in the form with speech. For example, Tom can say, "Please take me to the Great Wall Hotel." The multimodal browser recognizes the speech input and fills in the text box. VoiceXML is embedded in the XHTML page to enable Automatic Speech Recognition (ASR) functions, as Listing 1 shows.


Listing 1. VoiceXML in the XHTML page
      
<vxml:form id="vForm">
  <vxml:property name="confidencelevel" value="0.0"/>
  <vxml:block> <%=output %></vxml:block>		
  <vxml:field name="input">
    <vxml:grammar>
      <![CDATA[
      #JSGF V1.0;
      grammar english;
      public <english> = [<greeting>] <take>|<how>|<near>;
      <greeting> = Hello | Good (morning | afternoon | day | evening |night)
        | yes please | Thank you very much |My pleasure| Excuse me |Help;
      <take> = please take me to <place> ;
      <how> = How do I get to the <place> ;
      <near> = Is there [a] <place> near by ;
      <place> = Great Wall hotel | Holiday hotel | hotel | Summar Palace | Forbidden city
        | airport | station |bus station |metro station | subway station | police station 
        |post office |baker |bank |bar |bus stop | cafe |cake shop |hospital;
      ]]>
    </vxml:grammar>
    <vxml:filled>
      <vxml:assign name="document.xform.input.value" expr="input"/>
    </vxml:filled>
  </vxml:field>
</vxml:form>
    

The grammar defines what speech input from users the system can understand. The example is simple, but you can customize the grammar to meet other scenarios and requirements. Unfortunately, no matter how perfect the grammar is, it's impossible for the system to understand users' speech input correctly 100 percent of the time. However, users can still use a keyboard or stylus to enter or correct the text. This showcases the advantage of multimodal systems, because they provide a convenient and natural voice interface but are not limited to voice input and output. When speech recognition doesn't work well, keyboard and handwriting recognition are effective supplements.

After the system recognizes the user's speech input, it prompts if the text is correct. The user can say "Yes" or "OK," or can click the Translate button to submit an HTTP request to the application server. (If the text is incorrect, the user can refresh the page and speak again, or the user can edit the text with a keyboard or stylus.) The input text is a parameter in the HTTP request. The application server receives the request, gets the text, and translates it to Chinese. The translation server translates the text, and a Java™Bean component, also known as a translation bean, communicates with the translation server to send the translation request and receive the result. Meanwhile, the application records the dialog.

Translation bean

The main functions of the translation bean are communicating with WebSphere Translation Server, invoking the Java API provided by WebSphere Translation Server, and accomplishing the translation between English and Chinese. At the same time, the JavaBean component makes a dialog record of the translation text in both Chinese and English. The record is shown to users. The JavaServer Pages (JSP) code, <jsp:useBean id="translate" class="ibm.Translate" scope="session"/>, shows that the life cycle of the bean is session, so it records the text in one dialog.

WebSphere Translation Server provides two sets of APIs -- one that supports Java code and one that supports C. I used the Java API in the example (see Listing 2).


Listing 2. JavaBean construct code
public Translate(){
    record_cn = new ArrayList();
    record_en = new ArrayList();
    try{
        service_cnen = (LTinterface) LTengine.GetService("wtsserver.ibm.com", "cnen");
        service_encn = (LTinterface) LTengine.GetService("wtsserver.ibm.com ", "encn");
    }catch (Throwable t){ 
           t.printStackTrace(); 
           System.out.println("no service available"); 
    }    
}
      

The construct function of the translation bean initializes the translation service both for Chinese to English and English to Chinese. You can replace the server name of WebSphere Translation Server, wtsserver, with the IP address of WebSphere Translation Server. You use record_cn and record_en to save the dialog record. Listing 3 shows the trans function of the JavaBean component, which translates the text and records it.


Listing 3. JavaBean code receives the translation result from the translation server
public String trans(String lang, String input) {
    LTinterface service = null;
    String answer = null;
    if (lang.compareTo("cnen")==0)
        service = service_cnen;
    else
        service = service_encn;
    try{    
        handle=service.jltBeginTranslation("*format=text");      
        answer=service.jltTranslate(handle,input);      
        service.jltEndTranslation(handle);
    }catch (Exception e){
        e.printStackTrace(); 
    } 
    if (lang.compareTo("cnen")==0){
        record_cn.add(0, input);
        record_en.add(0, answer);
    }
    else{
        record_en.add(0, input);
        record_cn.add(0, answer);
    }
    return answer;
}
      

In the implementation, the JavaBean component chooses LTinterface according to the language type; invokes jltBeginTranslation, jlttranslate, and jltEndTranslation to accomplish the text translation; and then records the text.

This JavaBean component has another function, getrecord, which is used to get the dialog record. You can get the source code to see the detailed implementation of this function. This bean is used in JSP pages. After the user inputs and submits a request, it is forwarded to a new JSP page. The JSP page uses the JavaBean component and receives the translated text, as Listing 4 shows.


Listing 4. JSP page invokes the JavaBean component and gets the translation result
<%
  String input = request.getParameter("input");
  String output = null;
  if (input == null || input.trim().compareTo("")==0)
      output = "";
  else
      output = translate.trans("encn",input);
%>
      

With the code <%=translate.getrecord("cn")%>, the application shows the dialog text recorded in this session. Figure 3 shows the new page.


Figure 3. Translation result page
Translation result page

The JSP page contains the same VoiceXML code as shown in Listing 1:

<vxml:block> <%=output %></vxml:block>

This code lets the multimodal browser synthesize and play the translated text, so the taxi driver can see and hear what Tom says (in Chinese). Similarly, the taxi driver can speak Chinese, and the application translates the speech to English and plays it for Tom.


Conclusion

Multimodal applications can enrich the interaction between humans and machines. With the help of a multimodal translator, a machine can both receive a user's speech input and deliver information to the user with speech. Using the simple API provided by WebSphere Translation Server, the application can accomplish translation between languages, and you can communicate effectively with those who speak a different language.



Download

DescriptionNameSizeDownload method
Sample code for this articletranslate.war6KBHTTP

Information about download methods


Resources

Learn

Get products and technologies

  • IBM Multimodal site: Download the Multimodal Browser and the Multimodal Toolkit from the Multimodal site.

Discuss

About the author

Qi KeKe

Qi KeKe has worked for IBM for more than five years. He currently works with voice technology in the IBM China Development Laboratory (CDL) and is part of the Pervasive Computing (PvC) Voice and Radio Frequency Identification (RFID) Solution team for customer support and service.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere, Web development
ArticleID=164702
ArticleTitle=Improve mobile communication with a multimodal translator
publish-date=10032006
author1-email=qikeke@cn.ibm.com
author1-email-cc=