Applications for personal computers and small devices are rapidly evolving to meet the market demand for alternatives to keyboard-, keypad-, and stylus-based interaction. Alternative modes of interaction include voice and digital pen, and may be used either separately or combined with other modes. A cell phone user, for example, might get flight information by speaking into the phone's receiver, saying "Show me all flights from Boston to New York on December 23." In response, the application would show a list of flights on the cell phone screen, and the user could then pick one of the flights either by speaking or using the stylus.
The W3C Multimodal Interaction (MMI) Working Group has been at work since 2002 on a standard framework for developing such applications. Recently, the group published a new version of its Multimodal Architecture and Interfaces working draft. While this document is only a working draft, it is on track to becoming a W3C recommendation, and the MMI Working Group has made a lot of progress toward this goal.
This first article in a three-part series provides an overview of the MMI Working Group's Multimodal Architecture in its current form. I discuss both interesting features of the architecture and some of the challenges it poses to Web developers. In Part 2, the most speculative of the three articles, I discuss a number of XML markup languages (both W3C recommendations and early working drafts) in terms of how they might be used as markup interfaces to a Multimodal Architecture implementation. In Part 3 I describe a Web service implementation of the Multimodal Architecture and explain how the implementation would overcome most of the challenges previously described.
The Multimodal Architecture specifies a runtime framework and one or more distributed modality components, which communicate with the runtime framework through a life-cycle events API. As shown in Figure 1, the architecture consists of the following parts:
- Runtime framework
- The runtime framework controls communication between components and provides runtime support for the interaction manager, delivery context, and data component.
- Delivery context
- Static and dynamic device properties, environmental conditions, and user preferences are stored in the delivery context. These properties can then be queried and dynamically updated. The W3C Device Independence Working Group is working on standardizing the interface to the delivery context.
- Interaction manager
- The interaction manager sends and receives all messages between the runtime framework and the modality components, and queries and updates the data component as needed. Because it maintains the dialog flow, current state, and public data, it essentially contains the multimodal application. The MMI Working Group is currently working on standardizing an interaction manager language. This language, SCXML, or State Chart XML, is an XML implementation of David Harel's Statecharts (see Resources).
- Data component
- The data component contains the public data model for the multimodal application.
- Modality components
- Modality components perform tasks such as recognizing spoken input, displaying images, and running a markup language, such as VoiceXML, XHTML, or SVG. Because the modality components talk directly only with the interaction manager and can share data only through the event-based MMI life-cycle API, they are "loosely coupled." The user interaction is not required to be implemented by a markup language; interaction may be based on Java, C#, C++, or another programming language.
Figure 1. The Multimodal Architecture
The following events are sent between a modality component and the runtime framework:
- NewContextRequest
- A modality component sends this optional event to the runtime framework to request a new context. The runtime framework responds with a
NewContextResponseevent. - Prepare
- The runtime framework sends this optional event to a modality component so the component can pre-load resources in preparation for being invoked with
Start. The modality component responds with aPrepareResponseevent. - Start
- When the runtime framework invokes a modality component, the component responds with a
StartResponseevent. - Done
- A modality component sends this event to the runtime framework when it is done processing.
- Cancel
- When the runtime framework cancels the processing of a modality component, the component responds with a
CancelResponseevent. - Pause
- When the runtime framework pauses the processing of a modality component, the component responds with a
PauseResponseevent. - Resume
- When the runtime framework resumes the processing of a modality component, the component responds with a
ResumeResponseevent. - Data
- Either the runtime framework or the modality component can send data to the other. This event is the interface for all inter-modality communication, such as synchronizing focus with the interaction manager as mediator.
- ClearContext
- The runtime framework uses a
ClearContextevent to clear a context that was shared with a modality component. - StatusRequest
- Either the runtime framework or a modality component can request current status from the other. A
StatusResponseevent is sent in response to the request.
Because responses must be received in a timely manner, the network protocol that delivers MMI life-cycle events must be reliable and private, ensure delivery in proper order, and have authenticated endpoints.
Foundations of the architecture
The W3C Multimodal Architecture is derived primarily from two sources: the Galaxy Communicator and the Model-View-Controller (MVC) architectural pattern. The Galaxy Communicator is an open-source project sponsored by the US Defense Advanced Research Projects Agency (DARPA). It is a distributed hub-and-spokes architecture, wherein servers that process text-to-speech, perform speech recognition, or manage a dialog are at the ends of the spokes. Messages sent between the servers are routed through the central hub, which manages all data and communication.
The well-known MVC design pattern partitions an application into three parts: model, view, and controller. Generally in this architecture the model encapsulates the data and data access methods; the view is the user interface; and the controller processes events from the user interface. The controller updates the model in response to events and the view is updated in response to changes in the model.
The W3C Multimodal Architecture is analogous to the Galaxy Communicator in that the Communicator's hub is the Multimodal Architecture's runtime framework. Likewise, the Communicator's servers are the Multimodal Architecture's modality, delivery context, and data components. The Multimodal Architecture is like the MVC pattern in that the interaction manager is the controller; the data component and delivery context comprise the model; and the modality components are the views.
Communication and event processing
Based on these two architectures, the W3C Multimodal Architecture's runtime framework behaves as both a communication hub and an event processor. Modality components must send all events and data to the framework for processing and fair and timely routing to other components. Figure 2 shows the interaction manager behaving as the controller in an MVC architecture. After the user clicks on the Submit button, the submit data is sent to the interaction manager (1). The interaction manager submits the data to the Web server (2). The Web server's reply to the interaction manager is processed (3). The data model is updated and the updated data is sent simultaneously to the HTML and voice modality components (4). As a result the voice component sends audio data to the client's device to be rendered as text-to-speech, or TTS (5).
Figure 2. The interaction manager handles all data transactions
In any implementation of the Multimodal Architecture, events and data relevant to other modality components should be sent directly to the interaction manager, including forms submitted to the Web server. As explained by the MMI Working Group in the December 11, 2006 version of the Multimodal Architecture and Interfaces working draft:
It is therefore good application design practice to divide data into two logical classes: private data, which is of interest only to a given modality component, and public data, which is of interest to the Interaction Manager or to more than one Modality Component. Private data may be managed as the Modality Component sees fit, but all modification of public data, including submission to back end servers, should be entrusted to the Interaction Manager.
Distribution across multiple devices
Another interesting feature of the Multimodal Architecture is that it allows an application to be shared simultaneously among multiple devices. This feature is enabled by the "Russian doll" configuration of multimodal components, which allows a runtime framework to communicate with another runtime framework as if it were a multimodal component.
For an application to span multiple devices it would first have to allow any device to start the application, and then allow the same public data to be shared among all data models. In other words, an update to the application running on one device would be broadcast to all the other devices. An example of such an application is a shared calendar that assists several users in finding a convenient date and times to schedule a meeting. Figure 3 shows how an application can be shared among two devices according to the Russian doll configuration.
Figure 3. Multiple devices can simultaneously share the same application
Downsides of the Multimodal Architecture
An important aspect of the Multimodal Architecture is that each modality component runs its own markup language document. As a result, when the modality components reside on the same client, the client has to download and distribute multiple documents one or more times during the life of an application. The following two figures show why this is not practical.
Figure 4 shows a configuration where the interaction manager and data model reside on the remote server and an XHTML and a voice modality component both reside on the client. In this example the same Web server supplies both XHTML and VoiceXML documents to their respective components. The XHTML component fetches the first page of the application (1) and sends a NewContextRequest to the server (2) to start an instance of the interaction manager with a new context. The interaction manager runs a new SCXML document and, when instructed by the SCXML, sends a Start event (3) to the voice component. The Start event may include the VoiceXML document to run; otherwise the voice component has to fetch the document separately (4).
Figure 4. Client components connected separately to a server
Because all access to the markup documents run by each modality component happens through the interaction manager, all data received by one modality component has to be routed to the interaction manager and back to the client -- including cookies and other header information. This makes no sense when both modality components are running in the same process. Fetching cached documents must also be coordinated through the network. Every time the user hits the Back button, for example, the client has to send a request to the remote interaction manager to get the previous document from each modality component.
Another performance bottleneck
Figure 5 shows a configuration where the interaction manager, data model, and the XHTML and voice modality components reside on the same device. Here the runtime framework is entirely contained on the client. After the XHTML component fetches the first XHTML page of a multimodal application, it calls the local interaction manager's NewContextRequest API (1). The runtime framework fetches the SCXML document (2) and invokes the interaction manager. The interaction manager then runs the document and, when instructed by the SCXML, calls the voice component's Start event. The Start call may include the VoiceXML document to run; otherwise the voice component has to fetch the document separately (5).
Figure 5. Client components connected through IM to server
The issue with this configuration is once again performance. Only after the XHTML component fetches and parses the first page of an application and the SCXML document has been fetched and is running, and the VoiceXML document has also been fetched and is running, is the text-to-speech prompt presented to the user. Because the network is generally the performance bottleneck, the user could experience a noticeable (and therefore unacceptable) delay between seeing the XHTML page and hearing the text-to-speech. This delay would be alleviated if the SCXML document could be embedded within the XHTML document. The SCXML could then be passed to the interaction manager with the NewContextRequest API call. (Note that the above scenario is only for the sake of example: Either the XHTML or the voice component could retrieve the first page.)
In light of these issues I hope that the W3C will work on standardizing how XML representing different modalities can be combined into a single document. This is necessary for the above configuration of local modality components to be practical.
Questions for the MMI Working Group
In addition to performance issues, the W3C Multimodal Architecture is considering a number of other challenges that may or may not be addressed in a future release of the Multimodal Architecture and Interfaces specification. I'll conclude with an overview of questions facing the MMI Working Group.
- Another refactoring of the Web?
- For legacy applications to take advantage of the Multimodal Architecture, all communication, including form submissions to the Web server, must be replaced with
Dataevents to the interaction manager. Submissions could be forwarded to the interaction manager from the Web server, but wouldn't that make the Web server another modality component? As for Ajax (Asynchronous JavaScript and XML), it follows that theXMLHTTPRequestobject must contain theDataevent so the interaction manager can fetch the dynamic XML data for all the modality components. - Multimodal component capabilities?
- If an application requires a voice component that understands French (for example), how can it query the component's capabilities to find out whether it does? If the information is stored in the device context component, how can the application access the information? An interface to the device context is needed to resolve this issue.
- Generic data?
- The generic
Dataevent is the only interface available to markup authors; the other life-cycle events in the Multimodal Architecture concern the operation (for example, starting, stopping, pausing, and so on) of a modality component as a software component. Unfortunately, theDataevent is not interoperable from end-to-end unless the author knows everywhere how it is formatted upon receipt and how to format it for sending. Furthermore, any one of the components could change the data format at any time arbitrarily!
At the client, the contents of a receivedDataevent must be mapped to the events and data fields (for example, the Document Object Model, or DOM, for an XHTML client) instantiated on the client. This requires a scripting language such as JavaScript in order to inspect and handle theDatacontent. - Generic black boxes?
- Modality components are black boxes so that the data they maintain can be encapsulated. However, along with encapsulation of data there must be an API that represents the behavior associated with the private data. This is because all behavior may be accessed only by the API. Unfortunately, the
Dataevent API represents an indeterminate behavior; successiveDatacalls may update one small piece of data or the entire data model maintained by the component.
The problem is that the W3C Multimodal Working Group wants to support a wide variety of multimodal components, some of which may not have an XML language interface or a DOM. Unable to standardize what data may be encapsulated by the modality components, the Working Group cannot specify the associated behavior, either. Consequently, application developers will have to deal with the data format problem. - Multimodal markup?
- Before an XHTML component can send a
NewContextRequestevent to the interaction manager it must be indicated somehow within an XHTML page. It is up to the W3C Multimodal Interaction Working Group to standardize how theNewContextRequestis indicated and how SCXML markup can be embedded within the XHTML. Markup authors also need to know how to send an event to the interaction manager as well as how to listen for and handle messages sent by the interaction manager. - A multimodal protocol?
- The multimodal life-cycle events envisioned by the Multimodal Architecture could be sent from a Web browser through HTTP or the Ajax
XMLHttpRequestobject, and some applications could send and receive events through SIP if a SIP stack is available. However, Web browsers today do not support the receipt of asynchronous messages that would be "pushed" by the interaction manager. For this, we need a multimodal protocol that currently does not exist, although several candidates have been proposed within the IETF (see Resources).
The W3C Multimodal Architecture is a distributed architecture that defines a runtime framework, a life-cycle API, and numerous loosely coupled multimodal components. This article showed that the architecture is primarily dedicated to server-side interaction management, where the multimodal components are distributed over multiple clients and servers. The article has also revealed some of the challenges facing the MMI Working Group and briefly explained how these challenges will affect developers seeking to build Web applications using the architecture.
In the next article in this series I will discuss the XML markup languages that will address at least some of these challenges. For example, one specification (a W3C Working Draft last updated in 2003) proposes a set of intent-based events that may someday be added to the multimodal life-cycle API. SCXML (on track to becoming a W3C recommendation) is central to the Multimodal Architecture and the next generation of VoiceXML, known as V3. It may also become more generally important to server-side Web development. I'll discuss the potential impact of both of these languages, and more, in the next article in this series.
Learn
-
Multimodal interaction and the mobile Web, Part 1: Multimodal auto-fill (Gerald McCobb, developerWorks, November 2005): Extend a Web browser's auto-fill capabilities with voice interaction.
-
Multimodal interaction and the mobile Web, Part 2: Simple searches with Find-It (Marc White, developerWorks, December 2005): Enable voice access to a local search engine.
-
Multimodal interaction and the mobile Web, Part 3: User authentication (Gerald McCobb developerWorks, January 2006): Secure user authentication with voice and visual interaction.
-
Designing mobile Web services (Shu Fang Ru, developerWorks, January 2006): Learn more about crafting mobile Web services.
-
VoiceXML Forum: Read the XHTML + Voice 1.2 and Mobile X+V 1.2 specifications.
-
W3C's Voice Browser Activity page: Home of the State Chart XML (SCXML) working draft;
the VoiceXML 2.0 recommendation;
the Speech Recognition Grammar specification 1.0; and the Semantic Interpretation for Speech Recognition 1.0 draft recommendation.
-
W3C's Multimodal Interaction Activity page: Home of the Multimodal Architecture and Interfaces working draft; the Multimodal Application Developer Feedback working group note; and the EMMA: Extensible MultiModal Annotation markup language working draft.
-
The Internet Engineering Task Force (IETF): Home of the Distributed Multimodal Synchronization Protocol and the Widget Description Exchange Service (WIDEX) Internet drafts.
Get products and technologies
-
Opera: A multimodal browser.
Discuss
-
developerWorks blogs: Get involved in the developerWorks community.
Gerald McCobb has worked for IBM for over 15 years. He currently works in WebSphere Business Activity Monitor development. He is also IBM's representative to the W3C Multimodal Interaction Working Group.
Comments (Undergoing maintenance)





