The speech-technology industry took its first step toward the adoption of a Web programming model by standardizing VoiceXML, Version 2.0. First-generation voice-enabled Web applications were mostly built of static VoiceXML pages.
The next step is a move to complex applications deployed on standard Web servers and implemented through programs that deliver dynamically generated VoiceXML markup. To add speech-enabled Web applications to the mainstream is to adopt uniform programming models to create and deploy these speech-enabled Web applications.
On the visual Web, creating sophisticated user interaction is mediated by component libraries that ease the generation of complex HTML pages. The move to dynamically generated VoiceXML requires similar component libraries that capture best practices in Voice User Interface (VUI) design. Mainstreaming voice access to the Web changes today's practice of developing entire speech applications to a model where voice access is achieved by replacing the visual view layer with a high-quality VUI. In this model, you develop Web applications using standard application frameworks such as Struts; you achieve voice access by creating appropriate views that are assembled from a set of reusable and configurable components. You need to create such components within a framework that encourages interoperability across components to help unify the speech applications market.
Whereas the visual Web can rely on a persistent visual display backed by error-free user input, the speech medium is temporal and nonpersistent. Speech interaction is characterized by a sequence of turns where requests or pieces of information are alternatively spoken by the system and by the user. Although it is advancing at a fast pace, speech-recognition technology is still error-prone and needs to be backed up by confirmation, correction and reprompting. With prepackaged dialog components, Web developers can more efficiently handle these aspects of conversational interaction and ease the overall task of speech enablement.
For effective use by nonspeech specialists, speech components must embed much of the specific knowledge that enables the creation of high quality speech interfaces. Thus, you must incorporate grammars, prompts, confirmation, and correction strategies into these components. You must also ensure that the components are sufficiently configurable to allow reuse within a wide range of applications. Finally, you should be able to put together sophisticated components from simpler ones.
The Reusable Dialog Component (RDC) framework embodies all of these features. RDCs are interoperable components within the J2EE and JSP framework that offer a means to bring speech-specific knowledge to . Each RDC component is composed of a data model, speech-specific assets like grammar and prompts, configuration files, and the dialog logic needed to collect a piece of information. The VoiceXML that performs the VUI is generated by the component implementation. A developer writes an application by instantiating these components and specifying their run-time behaviors through component attributes and configuration files. The data model is where components store the values collected from the user interaction; and components handle data validation and normalization.
Component data models are implemented as Java beans. Each component implements a set of tasks including data collection, confirmation, validation and disambiguation. Component authors can provide custom implementations for all of these tasks. Atomic RDCs collect simple data values such as a time, the name of a place, or an alphanumeric string; you can put atoms together to form composite RDCs. You also can aggregate composite and atomic RDCs to form more complex components. The resulting composite RDCs are structured in the same way as atomic ones. They also have a data model, implement sets of tasks, and include speech-specific assets. Their behavior is specified by attributes and configuration files. The framework provides a container tag to facilitate the construction of composite RDCs. The container implementation invokes a pluggable dialog-management strategy that controls the activation of the constituent RDCs. The framework provides a default-directed dialog strategy that a developer can override.
Remove the cost from development of speech solutions
Building on standardized programming models creates the opportunity to develop mainstream tools for speech enablement. This section outlines the roadmap for how IBM sees today's world of speech-oriented applications and the evolution toward a world where speech enablement is just another aspect of overall application development.
Adopt the Web programming model for voice interaction
The speech-technology industry took one of its first steps toward integrating voice interaction with mainstream applications when it adopted VoiceXML and the associated Web programming model built around HTTP and distributed resources that are identified through URLs. This adoption allowed the speech-technology industry to move away from speech applications written as executable programs that link directly to the underlying speech engines. Today you can develop voice applications using standards-compliant VoiceXML, Version 2.0, which avoids tying the final application to any specific vendor's engine application programming interfaces (APIs).
From static to dynamic VoiceXML
To continue this evolution means creating Web applications that emit standards-compliant VoiceXML. This follows the same evolutionary pattern as seen on the visual Web; static HTML pages have been replaced over time by server-side Web application frameworks that emit HTML. Creation of standardized Web programming models that abstract the details of back-end integration, as well as the underlying business logic that determines the transitions among different stages in an application, have facilitated server-side deployment of Web applications. These standardized models help developers integrate user tasks into ever-larger applications. As the speech-enabled Web evolves in an analogous manner, voice application development moves from today's voice-specific programming model and associated tools to one in which voice interaction is authored as a specialized view that binds to a common underlying Web application.
Tools for speech-enabling Web applications can integrate seamlessly with mainstream Web application tools. An example is the Struts builder available within the IBM WebSphere® Studio Application Developer tool, as shown in Figure 1.
Figure 1. The Struts builder
Click to view a larger version of Figure 1.
With this Struts builder, speech specialists can focus on the task of creating high-quality voice user interaction without having to develop the complete application. These VUI components can incorporate best practices of VUI design and help ensure that speech-enabling Web applications do not sacrifice the quality of the user experience. Finally, during this transition period, you can still integrate existing speech-enabled applications created within today's voice-centric programming models into the overall application flow by using the underlying Web framework defined by HTTP. As an example, a voice-enabled financial portal created by binding a VUI to an underlying Web application might choose to invoke a pre-existing speech bank application through a URL, or more generally, as a Web service. (Struts allows the separation of the presentation layer from the underlying application flow. To produce the voice view, you can voice-enable Struts applications by replacing visual-view JSP pages with RDC-based JSP pages.)
The goal: Drive cost out of voice applications
When the transition to speech-enabling Web applications is complete, IBM expects the overall cost of voice-enablement to be significantly reduced from today's levels. Each link in the overall end-to-end value chain of speech application deployment can focus on a specific core competency.
Value propositions and business opportunities
Next, we outline how the mainstreaming of speech solutions by adding speech-enablement to the overall portfolio of Web technologies creates new business opportunities for different segments of the speech industry. The end-to-end value chain that makes up the creation, deployment, and delivery of voice applications comprises several parts. At present, vendors play in more than one part of this value chain -- some of them in at least two or three neighboring sectors. IBM's long-term goal is to help each class of vendors focus on their particular core competencies, while relying on interoperability that comes from using standards.
The momentum behind VoiceXML Version 2.0 has created an exponential growth in the software industry, and IBM expects this trend to be enhanced by the speech-enablement of J2EE Web applications using a standardized programming model that provides robust access, while controlling overall total cost of ownership (TCO). The ability of the mainstream Web programmer to generate high-quality VUIs expressed in VoiceXML can significantly enhance the value of robust VoiceXML browsers.
A standardized deployment environment based on the widely used and tested J2EE Web application architecture helps control the overall cost of hosting and maintaining speech-enabled applications.
Speech-recognition and text-to-speech (TTS) engines
J2EE Web developers can leverage the evolution of speech technologies to deliver on-demand spoken access to Web services. This can create more volume in the market request for speech technology, which can become part of the standard assets for Web applications. Engine vendors might be enticed to add advanced functionality and technological improvement in their core technologies to support advanced requirements defined by component creators and Web developers.
Tools that are consistent with interoperable components encourage developers to create libraries of speech-enabling building blocks. These libraries can lead to rapid application development (RAD) and free developers to focus on more-sophisticated user interactions.
Enterprises and service providers
As developers bring speech to standard J2EE Web applications, using the widely available skill set of J2EE and JSP Web development, they can add spoken access to businesses quickly and cost-effectively plus help control TCO.
As developers create dynamic voice access to Web applications and services, a standardized Web-programming model and associated tools help reduce the cost of developing on demand voice-enabled solutions. Speech-enabling J2EE applications through JSP technology and use of dialog components can create demand for application development services based on this standard programming model.
Speech-recognition technology is mature, and mainstream deployment of speech solutions can drive down costs in key areas like customer care. To reduce the cost of creating, managing, and deploying mainstream speech applications, developers must build on standardized Web-programming models. This can turn speech-enablement into yet another access channel to mainstream Web applications. To enable this evolution without sacrificing the overall quality of the user experience requires the packaging of speech-interaction expertise into standardized components that can be integrated into mainstream Web development environments.
- To learn more about speech technology, visit ibm.com/pervasive.
- Get the reference implementation of the supporting framework for the RDC effort, available as open source through the Apache Jakarta Taglibs project.
- Read the press release regarding IBM's donation to Apache and Eclipse.
- Learn more about the VoiceToolkit for WebSphere Studio, and then download it and try it out.
- Peruse WebSphere's Voice zone.
- Read this still educational white paper on "Extending and Enterprise with IBM WebSphere Voice Application Access" (PDF file, November 2002).
- Visit these valuable resources on developerWorks:
- The Web Architecture zone specializes in articles covering various Web-based solutions.
- Browse for books on these and other technical topics.
Comments (Undergoing maintenance)





