You can access the capabilities of the Speech to Text service via a WebSocket interface, an HTTP Representational State Transfer (REST) interface, or an asynchronous HTTP interface. Several Software Development Kits (SDKs) are also available to simplify application development in various languages and environments. The following sections provide an overview of application development with the service.
The Speech to Text service offers three programming interfaces for transcribing speech to text:
The WebSocket interface provides
a single version of the
recognize method for
transcribing audio. The interface offers efficient implementation,
low latency, and high throughput over a full-duplex connection.
The HTTP REST interface provides HTTP
POST versions of the
that transcribe audio with or without establishing a session with
the service. The methods let you send audio via the body of the
request or as multipart form data that consists of one or more audio
files. Additional methods of the interface let you establish and
maintain sessions with the service and obtain information about
supported languages and models.
The asynchronous HTTP interface provides
POST recognitions method for transcribing
audio. Additional methods of the interface enable you to register
a callback URL to which the service sends job status and optional
results or to check the status of jobs and retrieve results manually.
The interface uses HMAC-SHA1 signatures based on a user-specified
secret to provide authentication and data integrity for callback
notifications sent over the HTTP protocol.
While the various recognition methods share many common capabilities, you might specify the same parameter as a request header, a query parameter, or a parameter of a JSON object depending on the interface and method you are using. For more information about the service's features, see Input features and parameters and Output features and parameters.
The Speech to Text service provides a beta customization interface for creating custom language models for recognition of US English and Japanese speech. The service's base vocabulary contains many words that are used in everyday conversation, but it can lack knowledge of terms that are associated with specific domains. The customization interface allows you to improve the accuracy of speech recognition for particular domains. You can create custom language models that expand the service's base vocabulary with terminology specific to domains such as medicine and law.
You can use a custom language model with any of the interfaces described in the previous section. For more information about creating and using custom models, see Using customization and Additional customization methods.
The Speech to Text service supports a number of SDKs to simplify the development of speech applications. The SDKs are available for many popular programming languages and platforms, including Node.js, Java, Python, and Apple® iOS. For a complete list of SDKs and information about using them, see Using Watson SDKs. All SDKs are available from the watson-developer-cloud namespace on GitHub.
The service also makes available the following sample applications to help you get started:
You can access an application that uses the Node.js SDK at the speech-to-text-nodejs repository.
You can access a Python client that interacts with the service through its WebSocket interface at the speech-to-text-websockets-python repository.
For mobile development, in addition to the Watson Developer Cloud SDK for Apple® iOS, you can also use the Watson Speech SDK for the Google Android™ platform. Both SDKs support authenticating by using either your Bluemix® service credentials or an authentication token.
Like all Watson services, the Speech to Text service supports two typical programming models: Direct interaction, in which the client (for example, a web browser or an Android or iOS native app) streams audio to the service directly; and relaying via a proxy, in which the client and service exchange all data (requests, audio, and results) through a proxy application that resides in Bluemix®. With the HTTP interface, you can use either of the programming models; with the WebSocket interface, you must use direct communication.
For more information about working with Watson Developer Cloud services and Bluemix, see the following:
For an introduction to working with Watson services and Bluemix, see Getting started with Watson Developer Cloud and Bluemix.
For a language-independent introduction to developing Watson services applications in Bluemix, see Developing Watson applications with Bluemix.
For information about the two programming models available for developing Watson applications, see Programming models for Watson services:
With relaying via a proxy, the client relies on a proxy server that resides in Bluemix to communicate with the service; it passes all requests through the proxy application. Relaying requests via a proxy relies only on service credentials to authenticate with the service; see Obtaining credentials for Watson services.
With direct interaction, the client uses the proxy application in Bluemix only to obtain an authentication token for the service, after which it communicates directly with the service. Direct interaction uses service credentials only to obtain a token; see Using tokens with Watson services.
For information about controlling the default request logging that is performed for all Watson services, see Controlling request logging for Watson services.
Converting speech to text is a difficult problem. Some general things to consider when using the Speech to Text service in your applications follow:
Speech recognition can be very sensitive to input audio quality. When you experiment with the demo application or build an application of your own that uses the service, please try to ensure that the input audio quality is as good as possible. To obtain the best possible accuracy, use a close, speech-oriented microphone (such as a headset) whenever possible and adjust the microphone settings if necessary. Try to avoid using a laptop's built-in microphone.
Choosing the correct model is important. For most supported languages, the service supports two models: broadband and narrowband. IBM® recommends that you use the broadband model for responsive, real-time applications and the narrowband model for offline decoding of telephone speech. For more information about the models and the sampling rates they support, see Languages and models.
Detecting end of speech in the input audio stream is configurable. By default, the service stops transcription at the first pause, which is denoted by a half-second of non-speech (typically silence), or when the stream terminates. For information about directing the service to transcribe the entire audio stream until the stream terminates, see Continuous transmission.
Conversion of speech to text may not be perfect. Tremendous progress has been made over the last several years. Today, speech recognition technology is successfully used in many domains and applications. However, in addition to audio quality, speech recognition systems are sensitive to nuances of human speech, such as regional accents and differences in pronunciation, and may not always successfully transcribe audio input.