Using the WebSocket interface

The WebSocket interface of the Speech to Text service is the most natural way for a client to interact with the service. It has a number of advantages over the HTTP interface:

  • The WebSocket interface, unlike the REST interface, provides a single-socket, full-duplex communication channel. The interface lets the client send requests and audio to the service and receive results over a single connection in an asynchronous fashion.

  • It provides a much simpler and more powerful programming experience. The service can send event-driven responses to the client's messages, eliminating the need for the client to poll the server.

  • It reduces latency. Recognition results arrive faster because the service sends them directly to the client.

  • It reduces network utilization. The WebSocket protocol is very lightweight. It requires only a single connection to perform live recognition. When using sessions with the REST interface, conversely, you need at least four connections to achieve the same results.

  • It enables audio to be streamed directly from browsers (HTML5 WebSocket clients) to the service.

The WebSocket interface uses the recognize method to establish a connection with the service. It then relies on text and binary messages sent over the persistent connection to initiate and manage recognition requests. (If your application needs to call the models method, you must use the HTTP interface.)

For information about the steps you need to follow to use the WebSocket interface, see Making a recognition request. Subsequent sections describe WebSocket return codes and present an Example WebSocket session that shows example messages exchanged between the client and the service.

The snippets of example code in the following sections are written in JavaScript and are based on the HTML5 WebSocket API. For more information about the WebSocket protocol, see the Internet Engineering Task Force (IETF) Request for Comment (RFC) 6455.

Making a recognition request

The recognition request and response cycle comprises the following steps:

  1. Opening a connection and passing credentials

  2. Initiating a recognition request

  3. Sending audio and receiving recognition results

  4. Ending a recognition request

  5. Keeping a connection alive

  6. Closing a connection

When the client sends data to the service, it must pass all JSON messages as text messages and all audio data as binary messages.

Opening a connection and passing credentials

The Speech to Text service uses the WebSocket Secure (WSS) protocol to make the recognize method available at the following endpoint:

wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize

A WebSocket client calls this method with the following parameters to establish an authenticated connection with the service.

Parameter Type Data type Status Description
watson-token Query String Optional Passes a valid authentication token instead of passing the service credentials with the call. You can instead use the X-Watson-Authorization-Token header to pass the token, but you must pass a token in one of these two ways. You pass a token only to establish an authenticated connection with the service. Once you establish a connection, you can keep it alive indefinitely. As long as the connection remains open, you do not need to pass the token with subsequent calls. For more information, see Using tokens with Watson services.
model Query String Optional Specifies the language and model to be used for transcription. If you do not specify a model, the service uses the en-US_BroadbandModel model by default. For more information, see Languages and models.
customization_id Query String Optional Specifies the Globally Unique Identifier (GUID) of a custom language model that is to be used for all requests sent over the connection. The base language model of the custom model must match the value of the model parameter. By default, no custom model is used. For more information, see Custom language models.
x-watson-learning-opt-out Query Boolean Optional Specifies whether the service logs requests and results sent over the connection, which it does by default to improve the service for future users. Specify true or 1 to prevent the service from logging the data. You can also opt out of request logging by passing a value of true with the X-Watson-Learning-Opt-Out request header; for more information, see Controlling request logging for Watson services.

The following snippet of JavaScript code opens a connection with the service. The call to the recognize method passes the watson-token and model query parameters, the latter to direct the service to use the Spanish broadband model. Once the connection is established, the event listeners (onOpen, onClose, and so on) are defined to respond to events from the service.

var token = <authentication-token>;
var wsURI = 'wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?watson-token=' +
  token + '&model=es-ES_BroadbandModel';
var websocket = new WebSocket(wsURI);
websocket.onopen = function(evt) { onOpen(evt) };
websocket.onclose = function(evt) { onClose(evt) };
websocket.onmessage = function(evt) { onMessage(evt) };
websocket.onerror = function(evt) { onError(evt) };

Initiating a recognition request

To initiate a recognition request, the client sends a text message to the service over the established connection. The client must send this message before it sends any audio to be transcribed. The message must include the following two parameters.

Parameter Type Status Description
action String Required The value must be start to begin the request. Later sections describe other possible values (stop and no-op).
content-type String Required The format (MIME type) of the audio data:
  • audio/flac for Free Lossless Audio Codec (FLAC)
  • audio/l16 for Linear 16-bit Pulse-Code Modulation (PCM)
  • audio/wav for Waveform Audio File Format (WAV)
  • audio/ogg;codecs=opus for Ogg format files that use the opus codec
  • audio/mulaw for mu-law (or u-law) audio data
  • audio/basic for basic audio files
For more information, see Audio formats.

The message can also include optional parameters to specify additional aspects of how the request is to be processed and the information that is to be returned. For more information, see Input features and parameters and Output features and parameters. For detailed information about the interface, see the API reference. Note that a language model and a custom language model can be specified only as query parameters of the WebSocket URL.

The following snippet of JavaScript code sends initialization parameters for the recognition request over the WebSocket connection. The calls are included in the onOpen function defined for the client to ensure that they are sent only after the connection is established.

function onOpen(evt) {
   var message = {
      'action': 'start',
      'content-type': 'audio/l16;rate=22050'
   };
   websocket.send(JSON.stringify(message));
}

If it receives the request successfully, the service returns the following text message:

{'state': 'listening'}

If you specify an invalid query parameter or JSON field as part of the input for a recognition request, the JSON that is returned by the service includes a warnings field that describes and lists each invalid argument. The request succeeds despite the warnings.

Sending audio and receiving recognition results

After it sends the initial start message, the client can begin sending the audio data to the service. The client does not need to wait for the service to respond to the start message. The service returns the results of the transcription asynchronously in the same format as it returns results for the HTTP API.

The client must send the audio as binary data. The client can stream a maximum of 100 MB of audio data over a connection. The WebSocket interface imposes a maximum frame size of 4 MB. The client can set the maximum frame size to less than 4 MB or, if that is not practical, set the maximum message size to less than 4 MB and send the audio data as a sequence of messages.

The following snippet of JavaScript code sends audio data to the service as a binary message (blob):

websocket.send(blob);

The following snippet receives recognition hypotheses that the service returns asynchronously. The results are handled in the onMessage function defined for the client.

function onMessage(evt) {
   console.log(evt.data);
}

Ending a recognition request

When it is done sending audio data to the service, the client must signal the end of the binary transmission to the service in one of two ways:

  • By sending a JSON text message with the action parameter set to the value stop:

    {'action': 'stop'}
  • By sending an empty binary message, one in which the specified blob is empty:

    websocket.send(blob)

After it returns the final result for the transcription to the client, the service returns another {"state":"listening"} message to the client. This message indicates that the service is ready to receive another recognition request. Before sending another request, the client must already have signaled the end of transmission for the previous request as just described. Otherwise, the service returns no new results.

By default, if the client sends subsequent recognition requests for additional audio data over the same WebSocket connection, the service continues to use the parameters sent with the previous start message. To change the parameters for subsequent requests within the same connection, the client sends another start request with the desired parameters after it receives the final recognition result and {"state":"listening"} message from the service.

Keeping a connection alive

The service terminates the session and closes the connection if the inactivity or session timeout is reached, as described in Timeouts. The inactivity timeout occurs if audio is being sent by the client but the service detects no speech. The inactivity timeout is 30 seconds by default; you can use the inactivity_timeout parameter to specify a different value.

The session timeout occurs if the service receives no data from the client or sends no interim results for 30 seconds. You cannot change the length of this timeout. However, the client can extend the session by sending the service a JSON text message with the action parameter set to the value no-op:

{'action': 'no-op'}

This message touches the session and resets the timeout to keep the connection alive.

Closing a connection

When the client is done interacting with the service, it should close the WebSocket connection. Close the connection only after you have received all results; once the connection is closed, you can no longer use it to send requests or to receive results. The connection eventually times out and closes if you do not explicitly close it. The following snippet of JavaScript code closes an open connection:

websocket.close();

WebSocket return codes

The service can send the following return codes to the client over the WebSocket connection:

  • 1000 indicates normal closure of the connection, meaning that the purpose for which the connection was established has been fulfilled.

  • 1002 indicates that the service is closing the connection due to a protocol error.

  • 1006 indicates that the connection was closed abnormally.

  • 1009 indicates that the frame size exceeded the 4 MB limit.

  • 1011 indicates that the service is terminating the connection because it encountered an unexpected condition that prevents it from fulfilling the request.

If the socket closes with an error, it sends the client an informative message of the form {"error":"Specific error message"} before closing. For more information about WebSocket return codes, see IETF RFC 6455.

Example WebSocket session

The following exchange shows an example WebSocket session between a client and the Speech to Text service. Each line indicates whether the message is sent from the Client or returned by the Server. The messages are shown in three separate pieces to make it easier to follow, but all three pieces represent a single session with the service. The example focuses on the exchange of messages and does not reflect opening and closing the connection.

In the first example, the client sends audio that contains the string Name the Mayflower. The client sends the audio in two chunks in PCM (audio/l16) format, for which it indicates the required sampling rate. Note that the client does not wait for the {"state":"listening"} response from the service to begin sending the audio data. Sending the data immediately reduces latency because the audio is available to the service as soon as it is ready to handle a recognition request.

Client>> {"action": "start", "content-type": "audio/l16;rate=22050"}
Client>> <audio data chunk>
Server<< {"state": "listening"}
Client>> <audio data chunk>
Client>> {"action": "stop"}
Server<< {"results": [{"alternatives": [{"transcript": "name the mayflower "}],"final": true}],"result_index": 0}
Server<< {"state":"listening"}

In the second example, the client sends audio that contains the string Second audio transcript. The client sends the audio in a single binary message and uses the same parameters that it specified for the first request.

Client>> <audio data chunk>
Client>> {"action": "stop"}
Server<< {"results": [{"alternatives": [{"transcript": "second audio transcript "}],"final": true}],"result_index": 0}
Server<< {"state":"listening"}

In the third example, the client again sends audio that contains the string Name the Mayflower. As with the first example, the client sends the audio in two chunks in PCM format. This time, the client asks the server to send interim results for the transcription.

Client>> {"action": "start", "content-type": "audio/l16;rate=22050", "interim_results": true}
Server<< {"state":"listening"}
Client>> <audio data chunk>
Server<< {"results": [{"alternatives": [{"transcript": "name "}],"final": false}],"result_index": 0}
Server<< {"results": [{"alternatives": [{"transcript": "name may "}],"final": false}],"result_index": 0}
Client>> <audio data chunk>
Client>> {"action": "stop"}
Server<< {"results": [{"alternatives": [{"transcript": "name may flour "}],"final": false}],"result_index": 0}
Server<< {"results": [{"alternatives": [{"transcript": "name the mayflower "}],"final": true}],"result_index": 0}
Server<< {"state":"listening"}