AI for the Enterprise

Getting robots to listen: Using Watson’s Speech to Text service

Share this post:

Overview

This is the third article in a series of posts documenting how a team of six interns used IBM Watson to program robots to play poker.

In the previous article, we introduced the Watson services that are available to developers and how to interact with them with Watson Developer Cloud. In this article, we’ll show how we used the Speech-to-Text service to extract speech from audio.

The Speech to Text service

IBM Watson’s Speech to Text service takes an audio stream as input and returns the speech that was detected as output. It has a few extra features as well, including profanity filtering,formatting, word confidence, and more.

Here it is in action.

There are three ways to interact with the Speech to Text service: via WebSockets, via a session-ful REST API, and via a session-less REST API. The REST APIs can be used via Watson Developer Cloud. We will be covering how to use Speech to Text via Watson Developer Cloud and via WebSockets.

Setting up

Before we write any code, we need to create a Speech to Text service in Bluemix to interact with. Once you’ve made the service, you’ll authenticate with it and then you can use it to extract speech from audio.

Navigate to bluemix.net, log in, and go to the Dashboard. From the Dashboard, click Use Services or APIs:

01-dashboard

Scroll down and find the Watson Speech to Text service, and click View More:

02-service

Click Create to create the service:

03-create

Once the service has been created, click Service Credentials:

04-new-service

On this page are your credentials. You will use these to authenticate with the Speech to Text service when you are using it. Now that the service is created, you can see it by clicking Dashboard:

05-credentials

Your Speech to Text service is ready to go!

06-dashboard-service

Using Speech to Text with Watson Developer Cloud

Watson Developer Cloud makes it dead-simple to interact with your new Speech to Text Service. Make sure you have watson_developer_cloud installed, then make a new filestt.py. You can check that it’s installed correctly by importing the Speech to Text client:

 

from watson_developer_cloud import SpeechToTextV1

If there are no errors when you run the file, then you have installed Watson Developer Cloud correctly.

IBM has several SDKs available to easily interact with the Watson services. We’re using the Python SDK. Sometimes it’s helpful to browse the source code, so here’s a link to the GitHub repository.

Now let’s extract some speech from an audio file. I recorded myself saying “the quick red fox jumps over the lazy brown dog” and saved it as clip.wav in the same directory as stt.py. Let’s Use Watson Developer Cloud to recognize the audio in that clip:

 

from watson_developer_cloud import SpeechToTextV1
import json

stt = SpeechToTextV1(username="your username", password="your password")

audio_file = open("clip.wav", "rb")

print json.dumps(stt.recognize(audio_file, content_type="audio/wav"), indent=2)

This code calls the recognize() function, which sends the audio file over and receives a response containing a transcript of the speech Watson found in the audio clip. Here are the results:

 

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.973, 
          "transcript": "the quick red fox jumps over the lazy brown dog "
        }
      ], 
      "final": true
    }
  ], 
  "result_index": 0
}

With 97.3% confidence, Watson got “the quick red fox jumps over the lazy brown dog“. Cool!

The disadvantage of using the Speech to Text REST API via Watson Developer Cloud is that it is slow. In our project, we can’t afford to take a few seconds to save an audio file to disk, upload it to Watson, and wait for a response. Instead, we need to stream audio directly to Watson and get responses back. To solve this problem, Watson offers another way to use the Speech to Text service: WebSockets.

Using Speech to Text with WebSockets

WebSocket is a protocol that allows two-way communication between two hosts over a single connection. By using WebSocket, we can stream audio to the Watson Speech to Text service while also getting responses back at the same time.

To begin using WebSocket in Python, first we must install a package called ws4py (short for WebSocket-for-Python):

 

~$ sudo pip install ws4py

To connect to the Speech to Text service through a WebSocket, we will be using the ws4py.client.threadedclient.WebSocketClient class. Specifically, we will create a subclass of WebSocketClient and override some of its functions. To begin, let’s create some function stubs:

 

from ws4py.client.threadedclient import WebSocketClient

class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        pass

    def opened(self):
        pass

    def received_message(self, message):
        pass

stt_client = SpeechToTextClient()

Here’s what those function stubs do: opened() is called when communication through the WebSocket becomes available and received_message() is called when we receive a message from the other host. We can also use the WebSocketClient.send() function to send a message to the other host.

Let’s start communicating with the Speech to Text service. To begin, we send a message containing the JSON string {"action": "start"}. This tells the Speech to Text service to start listening for audio (docs).

 

from ws4py.client.threadedclient import WebSocketClient
import base64, time

class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

        username = "your username"
        password = "your password"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")

        try:
            WebSocketClient.__init__(self, ws_url,
                headers=[("Authorization", "Basic %s" % base64string)])
            self.connect()
        except: print "Failed to open WebSocket."

    def opened(self):
        self.send('{"action": "start", "content-type": "audio/l16;rate=16000"}')

    def received_message(self, message):
        print message

stt_client = SpeechToTextClient()
time.sleep(3)
stt_client.close()

 

If you run this code, you should get {"state": "listening"} as output. First we tell the Speech to Text service what format our audio is streamed in by passing in "content-type": "audio/l16;rate=16000". The Speech to Text service sent this message back through the WebSocket and it ended up in the received_message function where we printed it. Now we know that we are successfully communicating with the service.

Now we need to stream audio to the Watson service and get back what speech was detected in the audio. Our robots run Linux, so we have access to the arecord command. We’ll use that to retrieve and send audio data.

 

from ws4py.client.threadedclient import WebSocketClient
import base64, json, ssl, subprocess, threading, time

class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

        username = "your username"
        password = "your password"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")

        self.listening = False

        try:
            WebSocketClient.__init__(self, ws_url,
                headers=[("Authorization", "Basic %s" % base64string)])
            self.connect()
        except: print "Failed to open WebSocket."

    def opened(self):
        self.send('{"action": "start", "content-type": "audio/l16;rate=16000"}')
        self.stream_audio_thread = threading.Thread(target=self.stream_audio)
        self.stream_audio_thread.start()

    def received_message(self, message):
        message = json.loads(str(message))
        if "state" in message:
            if message["state"] == "listening":
                self.listening = True
        print "Message received: " + str(message)

    def stream_audio(self):
        while not self.listening:
            time.sleep(0.1)

        reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw"]
        p = subprocess.Popen(reccmd, stdout=subprocess.PIPE)

        while self.listening:
            data = p.stdout.read(1024)

            try: self.send(bytearray(data), binary=True)
            except ssl.SSLError: pass

        p.kill()

    def close(self):
        self.listening = False
        self.stream_audio_thread.join()
        WebSocketClient.close(self)

try:
    stt_client = SpeechToTextClient()
    raw_input()
finally:
    stt_client.close()

 

Now, if you run this code and say something to your microphone, you’ll get an output like this:

 

Message received: {u'state': u'listening'}
Recording raw data 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
Message received: {u'results': [{u'alternatives': [{u'confidence': 0.82, u'transcript': u'hello '}], u'final': True}], u'result_index': 0}
Message received: {u'state': u'listening'}

Awesome! Watson detected your speech!

If there are any questions, feel free to post on DeveloperWorks Answers or join the Slack community (get an invite here). In the next article we’ll talk about how we use the output of Speech to Text to figure out what people are trying to say using the Watson Natural Language Classifier.

Further Reading

More stories

A recap: here’s what you missed at this year’s BoxWorks

October 14, 2019 | AI for the Enterprise

At BoxWorks 2019, we were able to showcase the IBM and Box partnership, along with how it works and what’s in store for the future. ...read more


AIconics names IBM Watson Discovery Best Innovator in Natural Language Processing

June 20, 2019 | AI for the Enterprise, Discovery and Exploration

On June 11, the world’s only independently judged enterprise AI awards – the AIconics – named Watson Discovery the winner for “Best Innovation in NLP.” Natural Language Processing is the area of computer science and AI that governs the interaction between computers and human languages. Specifically, NLP concerns how computers process and analyze unstructured natural language data. ...read more


IBM Watson Assistant gets smarter and faster, making customer service a breeze

June 20, 2019 | AI for the Enterprise, Conversational Services

We're excited to announce new Watson Assistant features that are designed to change the way businesses interact with their users. Watson Assistant not only helps answer customer questions quickly and accurately, but it also ensures that employees are empowered to do their jobs efficiently. ...read more