Contribute in GitHub:

Integrating third-party text to speech services

IBM® Voice Gateway supports using speech adapters to integrate third-party speech synthesis (text-to-speech) services in place of the IBM® Text to Speech services. The adapters are separate Docker containers that you deploy together with Voice Gateway and act as proxies that sit between Voice Gateway and the third-party speech service.

Voice Gateway provides the following options for integrating third-party speech services:

Voice Gateway Text to Speech Adapter: The Text to Speech adapter currently enables the Google Text to Speech API to synthesize speech as audio from text. By using Google Text to Speech API, you can choose additional voices for a self-service agent. Version 1.0.0.7a and later.

Text to Speech Adapter

Text to Speech Adapter architecture

When using Text to Speech, the Media Relay container in Voice Gateway uses a websocket connection to route data to Text to Speech. By using the Text to Speech Adapter, you can connect your Voice Gateway deployment to a third-party text to speech provider. Rather than using a websocket connection to integrate your deployment to Text to Speech, the Media Relay connects to the Text to Speech Adapter with a websocket connection. This adapter then uses a gRPC connection with the third-party text to speech service to synthesize audio and return it to the Media Relay, which streams the audio to the caller.

In the following example, the Text to Speech adapter connects Voice Gateway with the Google Text to Speech Beta service.

The Text to Speech adapter acts as an intermediary between the Voice Gateway Media Relay and a third-party text to speech service, like the Google Text to Speech Beta service. — Text to Speech Adapter architecture

Deploying the Text to Speech Adapter

The Voice Gateway Text to Speech Adapter is packaged as a separate Docker image that you configure and deploy along with the core SIP Orchestrator and Media Relay images. Before you deploy the Text to Speech Adapter, deploy a basic Voice Gateway instance as described in Getting started with Voice Gateway. Then, learn more about how add the Text to Speech Adapter to your deployment in the following pages:

Configuring the Text to Speech Adapter

To set up the Text to Speech Adapter, you can define the following types of configuration.

Deployment configuration, which defines the Text to Speech Adapter container and is specified as Docker environment variables. For more information, see Text to Speech Adapter environment variables.
JSON configuration, which you can specify to separately configure multiple tenants within a single Voice Gateway environment. For more information, see Setting up a multi-tenant environment.
Dynamic configuration, which enables you to change settings during a call by specifying API actions and state variables in the Watson Assistant dialog node response. For more information, see Programming self-service agents using the Voice Gateway API.