The W3C Multimodal Architecture, Part 1: Overview and challenges

What you should know about the emerging architecture for distributed multimodal applications

The W3C Multimodal Interaction Working Group has been refining its proposal for a Multimodal Architecture since 2002. In this first article in a three-part series, Gerald McCobb of IBM presents an overview of the group's progress. Get an early look at the emerging architecture and learn about the challenges Web developers should consider when deciding whether to implement it.


Gerald McCobb (, Advisory Software Engineer, IBM

Gerald McCobb has worked for IBM for over 15 years. He currently works in WebSphere Business Activity Monitor development. He is also IBM's representative to the W3C Multimodal Interaction Working Group.

08 May 2007

Also available in Japanese

Applications for personal computers and small devices are rapidly evolving to meet the market demand for alternatives to keyboard-, keypad-, and stylus-based interaction. Alternative modes of interaction include voice and digital pen, and may be used either separately or combined with other modes. A cell phone user, for example, might get flight information by speaking into the phone's receiver, saying "Show me all flights from Boston to New York on December 23." In response, the application would show a list of flights on the cell phone screen, and the user could then pick one of the flights either by speaking or using the stylus.

The W3C Multimodal Interaction (MMI) Working Group has been at work since 2002 on a standard framework for developing such applications. Recently, the group published a new version of its Multimodal Architecture and Interfaces working draft. While this document is only a working draft, it is on track to becoming a W3C recommendation, and the MMI Working Group has made a lot of progress toward this goal.

This first article in a three-part series provides an overview of the MMI Working Group's Multimodal Architecture in its current form. I discuss both interesting features of the architecture and some of the challenges it poses to Web developers. In Part 2, the most speculative of the three articles, I discuss a number of XML markup languages (both W3C recommendations and early working drafts) in terms of how they might be used as markup interfaces to a Multimodal Architecture implementation. In Part 3 I describe a Web service implementation of the Multimodal Architecture and explain how the implementation would overcome most of the challenges previously described.

Why a distributed architecture?

Although the addition of another modal interface to an application should make that application easier to use, the processing and resource requirements are often too great for a small client device. A cell phone, for example, does not have the resources to run a local speech recognition system, especially if the application requires a grammar containing the names and addresses of everyone in Los Angeles, California.

For this reason, the additional modal interface running on a small client device will in most cases be distributed. That is, it will reside on a remote server and communicate with the device through a network protocol such as HTTP or SIP. Accordingly, the W3C Multimodal Architecture supports distributed-modality components and defines a component life-cycle API for remote messaging between modality components and a server-side interaction manager. At the same time, implementations with the modality components running on the client, while allowed by the Multimodal Architecture, are not practical for Web applications, as this article explains.

The Multimodal Architecture

The Multimodal Architecture specifies a runtime framework and one or more distributed modality components, which communicate with the runtime framework through a life-cycle events API. As shown in Figure 1, the architecture consists of the following parts:

Runtime framework
The runtime framework controls communication between components and provides runtime support for the interaction manager, delivery context, and data component.
Delivery context
Static and dynamic device properties, environmental conditions, and user preferences are stored in the delivery context. These properties can then be queried and dynamically updated. The W3C Device Independence Working Group is working on standardizing the interface to the delivery context.
Interaction manager
The interaction manager sends and receives all messages between the runtime framework and the modality components, and queries and updates the data component as needed. Because it maintains the dialog flow, current state, and public data, it essentially contains the multimodal application. The MMI Working Group is currently working on standardizing an interaction manager language. This language, SCXML, or State Chart XML, is an XML implementation of David Harel's Statecharts (see Resources).
Data component
The data component contains the public data model for the multimodal application.
Modality components
Modality components perform tasks such as recognizing spoken input, displaying images, and running a markup language, such as VoiceXML, XHTML, or SVG. Because the modality components talk directly only with the interaction manager and can share data only through the event-based MMI life-cycle API, they are "loosely coupled." The user interaction is not required to be implemented by a markup language; interaction may be based on Java, C#, C++, or another programming language.
Figure 1. The Multimodal Architecture
A diagram of the W3C Multimodal Architecture

The life-cycle events API

The following events are sent between a modality component and the runtime framework:

A modality component sends this optional event to the runtime framework to request a new context. The runtime framework responds with a NewContextResponse event.
The runtime framework sends this optional event to a modality component so the component can pre-load resources in preparation for being invoked with Start. The modality component responds with a PrepareResponse event.
When the runtime framework invokes a modality component, the component responds with a StartResponse event.
A modality component sends this event to the runtime framework when it is done processing.
When the runtime framework cancels the processing of a modality component, the component responds with a CancelResponse event.
When the runtime framework pauses the processing of a modality component, the component responds with a PauseResponse event.
When the runtime framework resumes the processing of a modality component, the component responds with a ResumeResponse event.
Either the runtime framework or the modality component can send data to the other. This event is the interface for all inter-modality communication, such as synchronizing focus with the interaction manager as mediator.
The runtime framework uses a ClearContext event to clear a context that was shared with a modality component.
Either the runtime framework or a modality component can request current status from the other. A StatusResponse event is sent in response to the request.

Because responses must be received in a timely manner, the network protocol that delivers MMI life-cycle events must be reliable and private, ensure delivery in proper order, and have authenticated endpoints.

Foundations of the architecture

The W3C Multimodal Architecture is derived primarily from two sources: the Galaxy Communicator and the Model-View-Controller (MVC) architectural pattern. The Galaxy Communicator is an open-source project sponsored by the US Defense Advanced Research Projects Agency (DARPA). It is a distributed hub-and-spokes architecture, wherein servers that process text-to-speech, perform speech recognition, or manage a dialog are at the ends of the spokes. Messages sent between the servers are routed through the central hub, which manages all data and communication.

The well-known MVC design pattern partitions an application into three parts: model, view, and controller. Generally in this architecture the model encapsulates the data and data access methods; the view is the user interface; and the controller processes events from the user interface. The controller updates the model in response to events and the view is updated in response to changes in the model.

The W3C Multimodal Architecture is analogous to the Galaxy Communicator in that the Communicator's hub is the Multimodal Architecture's runtime framework. Likewise, the Communicator's servers are the Multimodal Architecture's modality, delivery context, and data components. The Multimodal Architecture is like the MVC pattern in that the interaction manager is the controller; the data component and delivery context comprise the model; and the modality components are the views.

Communication and event processing

Based on these two architectures, the W3C Multimodal Architecture's runtime framework behaves as both a communication hub and an event processor. Modality components must send all events and data to the framework for processing and fair and timely routing to other components. Figure 2 shows the interaction manager behaving as the controller in an MVC architecture. After the user clicks on the Submit button, the submit data is sent to the interaction manager (1). The interaction manager submits the data to the Web server (2). The Web server's reply to the interaction manager is processed (3). The data model is updated and the updated data is sent simultaneously to the HTML and voice modality components (4). As a result the voice component sends audio data to the client's device to be rendered as text-to-speech, or TTS (5).

Figure 2. The interaction manager handles all data transactions
The interaction manager seen as a controller and hub

In any implementation of the Multimodal Architecture, events and data relevant to other modality components should be sent directly to the interaction manager, including forms submitted to the Web server. As explained by the MMI Working Group in the December 11, 2006 version of the Multimodal Architecture and Interfaces working draft:

It is therefore good application design practice to divide data into two logical classes: private data, which is of interest only to a given modality component, and public data, which is of interest to the Interaction Manager or to more than one Modality Component. Private data may be managed as the Modality Component sees fit, but all modification of public data, including submission to back end servers, should be entrusted to the Interaction Manager.

Distribution across multiple devices

Another interesting feature of the Multimodal Architecture is that it allows an application to be shared simultaneously among multiple devices. This feature is enabled by the "Russian doll" configuration of multimodal components, which allows a runtime framework to communicate with another runtime framework as if it were a multimodal component.

For an application to span multiple devices it would first have to allow any device to start the application, and then allow the same public data to be shared among all data models. In other words, an update to the application running on one device would be broadcast to all the other devices. An example of such an application is a shared calendar that assists several users in finding a convenient date and times to schedule a meeting. Figure 3 shows how an application can be shared among two devices according to the Russian doll configuration.

Figure 3. Multiple devices can simultaneously share the same application
A diagram showing how multiple devices can share the same application.

Downsides of the Multimodal Architecture

An important aspect of the Multimodal Architecture is that each modality component runs its own markup language document. As a result, when the modality components reside on the same client, the client has to download and distribute multiple documents one or more times during the life of an application. The following two figures show why this is not practical.

Figure 4 shows a configuration where the interaction manager and data model reside on the remote server and an XHTML and a voice modality component both reside on the client. In this example the same Web server supplies both XHTML and VoiceXML documents to their respective components. The XHTML component fetches the first page of the application (1) and sends a NewContextRequest to the server (2) to start an instance of the interaction manager with a new context. The interaction manager runs a new SCXML document and, when instructed by the SCXML, sends a Start event (3) to the voice component. The Start event may include the VoiceXML document to run; otherwise the voice component has to fetch the document separately (4).

Figure 4. Client components connected separately to a server
A client browser with components connected separately to the server

Because all access to the markup documents run by each modality component happens through the interaction manager, all data received by one modality component has to be routed to the interaction manager and back to the client -- including cookies and other header information. This makes no sense when both modality components are running in the same process. Fetching cached documents must also be coordinated through the network. Every time the user hits the Back button, for example, the client has to send a request to the remote interaction manager to get the previous document from each modality component.

Another performance bottleneck

Figure 5 shows a configuration where the interaction manager, data model, and the XHTML and voice modality components reside on the same device. Here the runtime framework is entirely contained on the client. After the XHTML component fetches the first XHTML page of a multimodal application, it calls the local interaction manager's NewContextRequest API (1). The runtime framework fetches the SCXML document (2) and invokes the interaction manager. The interaction manager then runs the document and, when instructed by the SCXML, calls the voice component's Start event. The Start call may include the VoiceXML document to run; otherwise the voice component has to fetch the document separately (5).

Figure 5. Client components connected through IM to server
Client components connected through IM to server

The issue with this configuration is once again performance. Only after the XHTML component fetches and parses the first page of an application and the SCXML document has been fetched and is running, and the VoiceXML document has also been fetched and is running, is the text-to-speech prompt presented to the user. Because the network is generally the performance bottleneck, the user could experience a noticeable (and therefore unacceptable) delay between seeing the XHTML page and hearing the text-to-speech. This delay would be alleviated if the SCXML document could be embedded within the XHTML document. The SCXML could then be passed to the interaction manager with the NewContextRequest API call. (Note that the above scenario is only for the sake of example: Either the XHTML or the voice component could retrieve the first page.)

In light of these issues I hope that the W3C will work on standardizing how XML representing different modalities can be combined into a single document. This is necessary for the above configuration of local modality components to be practical.

Questions for the MMI Working Group

In addition to performance issues, the W3C Multimodal Architecture is considering a number of other challenges that may or may not be addressed in a future release of the Multimodal Architecture and Interfaces specification. I'll conclude with an overview of questions facing the MMI Working Group.

Another refactoring of the Web?
For legacy applications to take advantage of the Multimodal Architecture, all communication, including form submissions to the Web server, must be replaced with Data events to the interaction manager. Submissions could be forwarded to the interaction manager from the Web server, but wouldn't that make the Web server another modality component? As for Ajax (Asynchronous JavaScript and XML), it follows that the XMLHTTPRequest object must contain the Data event so the interaction manager can fetch the dynamic XML data for all the modality components.
Multimodal component capabilities?
If an application requires a voice component that understands French (for example), how can it query the component's capabilities to find out whether it does? If the information is stored in the device context component, how can the application access the information? An interface to the device context is needed to resolve this issue.
Generic data?
The generic Data event is the only interface available to markup authors; the other life-cycle events in the Multimodal Architecture concern the operation (for example, starting, stopping, pausing, and so on) of a modality component as a software component. Unfortunately, the Data event is not interoperable from end-to-end unless the author knows everywhere how it is formatted upon receipt and how to format it for sending. Furthermore, any one of the components could change the data format at any time arbitrarily!

At the client, the contents of a received Data event must be mapped to the events and data fields (for example, the Document Object Model, or DOM, for an XHTML client) instantiated on the client. This requires a scripting language such as JavaScript in order to inspect and handle the Data content.
Generic black boxes?
Modality components are black boxes so that the data they maintain can be encapsulated. However, along with encapsulation of data there must be an API that represents the behavior associated with the private data. This is because all behavior may be accessed only by the API. Unfortunately, the Data event API represents an indeterminate behavior; successive Data calls may update one small piece of data or the entire data model maintained by the component.

The problem is that the W3C Multimodal Working Group wants to support a wide variety of multimodal components, some of which may not have an XML language interface or a DOM. Unable to standardize what data may be encapsulated by the modality components, the Working Group cannot specify the associated behavior, either. Consequently, application developers will have to deal with the data format problem.
Multimodal markup?
Before an XHTML component can send a NewContextRequest event to the interaction manager it must be indicated somehow within an XHTML page. It is up to the W3C Multimodal Interaction Working Group to standardize how the NewContextRequest is indicated and how SCXML markup can be embedded within the XHTML. Markup authors also need to know how to send an event to the interaction manager as well as how to listen for and handle messages sent by the interaction manager.
A multimodal protocol?
The multimodal life-cycle events envisioned by the Multimodal Architecture could be sent from a Web browser through HTTP or the Ajax XMLHttpRequest object, and some applications could send and receive events through SIP if a SIP stack is available. However, Web browsers today do not support the receipt of asynchronous messages that would be "pushed" by the interaction manager. For this, we need a multimodal protocol that currently does not exist, although several candidates have been proposed within the IETF (see Resources).

In conclusion

The W3C Multimodal Architecture is a distributed architecture that defines a runtime framework, a life-cycle API, and numerous loosely coupled multimodal components. This article showed that the architecture is primarily dedicated to server-side interaction management, where the multimodal components are distributed over multiple clients and servers. The article has also revealed some of the challenges facing the MMI Working Group and briefly explained how these challenges will affect developers seeking to build Web applications using the architecture.

In the next article in this series I will discuss the XML markup languages that will address at least some of these challenges. For example, one specification (a W3C Working Draft last updated in 2003) proposes a set of intent-based events that may someday be added to the multimodal life-cycle API. SCXML (on track to becoming a W3C recommendation) is central to the Multimodal Architecture and the next generation of VoiceXML, known as V3. It may also become more generally important to server-side Web development. I'll discuss the potential impact of both of these languages, and more, in the next article in this series.



Get products and technologies

  • Opera: A multimodal browser.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Web development on developerWorks

Zone=Web development
ArticleTitle=The W3C Multimodal Architecture, Part 1: Overview and challenges