Overview for developers

You can access the capabilities of the Speech to Text service via a WebSocket interface, an HTTP Representational State Transfer (REST) interface, or an asynchronous HTTP interface. Several Software Development Kits (SDKs) are also available to simplify application development in various languages and environments. The following sections provide an overview of application development with the service.

Programming with the service

The Speech to Text service offers three programming interfaces for transcribing speech to text:

  • The WebSocket interface provides a single version of the recognize method for transcribing audio. The interface offers efficient implementation, low latency, and high throughput over a full-duplex connection.

  • The HTTP REST interface provides HTTP POST versions of the recognize method that transcribe audio with or without establishing a session with the service. The methods let you send audio via the body of the request or as multipart form data that consists of one or more audio files. Additional methods of the interface let you establish and maintain sessions with the service and obtain information about supported languages and models.

  • The asynchronous HTTP interface provides a non-blocking POST recognitions method for transcribing audio. Additional methods of the interface enable you to register a callback URL to which the service sends job status and optional results or to check the status of jobs and retrieve results manually. The interface uses HMAC-SHA1 signatures based on a user-specified secret to provide authentication and data integrity for callback notifications sent over the HTTP protocol.

While the various recognition methods share many common capabilities, you might specify the same parameter as a request header, a query parameter, or a parameter of a JSON object depending on the interface and method you are using. For more information about the service's features, see Input features and parameters and Output features and parameters.

Using the customization interface

The Speech to Text service provides a beta customization interface for creating custom language models for recognition of US English and Japanese speech. The service's base vocabulary contains many words that are used in everyday conversation, but it can lack knowledge of terms that are associated with specific domains. The customization interface allows you to improve the accuracy of speech recognition for particular domains. You can create custom language models that expand the service's base vocabulary with terminology specific to domains such as medicine and law.

You can use a custom language model with any of the interfaces described in the previous section. For more information about creating and using custom models, see Using customization and Additional customization methods.

Using Software Development Kits

The Speech to Text service supports a number of SDKs to simplify the development of speech applications. The SDKs are available for many popular programming languages and platforms, including Node.js, Java, Python, and Apple® iOS. For a complete list of SDKs and information about using them, see Using Watson SDKs. All SDKs are available from the watson-developer-cloud namespace on GitHub.

The service also makes available the following sample applications to help you get started:

For mobile development, in addition to the Watson Developer Cloud SDK for Apple® iOS, you can also use the Watson Speech SDK for the Google Android™ platform. Both SDKs support authenticating by using either your Bluemix® service credentials or an authentication token.

Learning more about application development

Like all Watson services, the Speech to Text service supports two typical programming models: Direct interaction, in which the client (for example, a web browser or an Android or iOS native app) streams audio to the service directly; and relaying via a proxy, in which the client and service exchange all data (requests, audio, and results) through a proxy application that resides in Bluemix®. With the HTTP interface, you can use either of the programming models; with the WebSocket interface, you must use direct communication.

For more information about working with Watson Developer Cloud services and Bluemix, see the following:

Considerations for application development

Converting speech to text is a difficult problem. Some general things to consider when using the Speech to Text service in your applications follow:

  • Speech recognition can be very sensitive to input audio quality. When you experiment with the demo application or build an application of your own that uses the service, please try to ensure that the input audio quality is as good as possible. To obtain the best possible accuracy, use a close, speech-oriented microphone (such as a headset) whenever possible and adjust the microphone settings if necessary. Try to avoid using a laptop's built-in microphone.

  • Choosing the correct model is important. For most supported languages, the service supports two models: broadband and narrowband. IBM® recommends that you use the broadband model for responsive, real-time applications and the narrowband model for offline decoding of telephone speech. For more information about the models and the sampling rates they support, see Languages and models.

  • Detecting end of speech in the input audio stream is configurable. By default, the service stops transcription at the first pause, which is denoted by a half-second of non-speech (typically silence), or when the stream terminates. For information about directing the service to transcribe the entire audio stream until the stream terminates, see Continuous transmission.

  • Conversion of speech to text may not be perfect. Tremendous progress has been made over the last several years. Today, speech recognition technology is successfully used in many domains and applications. However, in addition to audio quality, speech recognition systems are sensitive to nuances of human speech, such as regional accents and differences in pronunciation, and may not always successfully transcribe audio input.