About Speech to Text

The Speech to Text service was updated on April 10, 2017. The service now supports speaker labels for the en-US-BroadbandModel, es-ES-BroadbandModel, and ja-JP_BroadbandModel models; the audio/webm audio format; and a new unregister_callback method for its asynchronous interface. For more information, see the Release notes.

Important deprecation notices: Effective May 2017, IBM is deprecating the following functionality: the no-op feature that lets you keep a session alive by avoiding a session timeout, and the continuous parameter that is available with all recognition requests. For more information about the planned changes, see Important deprecation notices in the release notes.

The IBM® Speech to Text service provides an Application Programming Interface (API) that lets you add speech transcription capabilities to your applications. To transcribe the human voice accurately, the service leverages machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. The service continuously returns and retroactively updates the transcription as more speech is heard.

Overview for developers introduces the three interfaces provided by the service: a WebSocket interface, an HTTP REST interface, and an asynchronous HTTP interface. It also introduces the HTTP customization interface (beta) for creating custom language models for US English and Japanese. In addition, it provides information about the SDKs that are available for working with the service and about application development with Watson services in general.

The WebSocket, HTTP, and asynchronous HTTP interfaces share many common features for transcribing speech to text. The interfaces support the following Input features:

  • Languages: Supports Brazilian Portuguese, French, Japanese, Mandarin Chinese, Modern Standard Arabic, Spanish, UK English, and US English.
  • Models: For most languages, supports both broadband (for audio that is sampled at a minimum rate of 16 KHz) and narrowband (for audio that is sampled at a minimum rate of 8 KHz) models.
  • Audio formats: Transcribes Free Lossless Audio Codec (FLAC), Linear 16-bit Pulse-Code Modulation (PCM), Waveform Audio File Format (WAV), Ogg format with the Opus or Vorbis codec, Web Media (WebM) format with the Opus or Vorbis codec, mu-law (or u-law) audio data, or basic audio.
  • Audio transmission: Lets the client pass as much as 100 MB of audio to the service as a continuous stream of data chunks or as a one-shot delivery, passing all of the data at one time. With streaming, the service enforces various timeouts to preserve resources.

The interfaces also support the following Output features:

  • Speaker labels (beta): Recognizes different speakers from audio in US English, Spanish, or Japanese. This feature provides a transcription that labels each speaker's contributions to a multi-participant conversation.
  • Keyword spotting (beta): Identifies spoken phrases from the audio that match specified keyword strings with a user-defined level of confidence. This feature is especially useful when individual words or topics from the input are more important than the full transcription. For example, it can be used with a customer support system to determine how to route or categorize a customer request.
  • Word alternatives (beta), confidence, and timestamps: Reports alternative words that are acoustically similar to the words that it transcribes, confidence levels for each of the words that it transcribes, and timestamps for the start and end of each word.
  • Maximum alternatives and interim results: Returns alternative and interim transcription results. The former provide different possible hypotheses; the latter represent interim hypotheses as the transcription progresses. In both cases, the service indicates final results in which it has the greatest confidence.
  • Profanity filtering: Censors profanity from US English transcriptions by default. You can use the filtering to sanitize the service's output.
  • Smart formatting (beta): Converts dates, times, numbers, phone numbers, and currency values in final transcripts of US English audio into more readable, conventional forms.

For information about the pricing plans available for the service, see the Speech to Text service in Bluemix®.

Use cases

The Speech to Text service can be used in any application where speech or audio files are used as input and in which text is the desired output format. Examples of these types of applications include

  • Voice control of applications, embedded devices, vehicle accessories, and so on
  • Transcribing meetings and conference calls
  • Dictating email messages, notes, and so on

For examples of the service in action, see

  • A quick demo of the Speech to Text service that lets you transcribe text from streaming audio input or from a file that you upload.
  • Applications in Watson Developer Cloud Starter Kits that demonstrate the Speech to Text service.
  • An Application Starter Kit that uses the Speech to Text and AlchemyLanguage services to demonstrate live audio analysis.
  • The IBM Watson blog post Getting robots to listen: Using Watson's Speech to Text service that shows how to use the service's WebSocket interface with Python to extract speech from audio. The post provides a thorough tutorial that demonstrates the steps and the code involved.

And finally, the Speech to Text FAQ answers questions that are commonly asked by users of the service.

Questions and feedback

We are always looking to improve and learn from your experience with our services:

  • Ask programming-related questions in the Watson forums on Stack Overflow.
  • Submit comments or ask product-related questions about this service in the Watson forum on dW Answers.
  • Read general posts about Watson services that are written by IBM researchers, developers, and other experts on the Watson blog.