Customizing Watson Speech Library for Embed

For a basic usage of both services this example will use Text to Speech to synthesize an audio file and then pass that through to Speech to Text to recognize the utterances. Both have examples of customization.

For details on STT customization features and support for next generation models, view the STT API docs. Note that acoustic model customization is not supported.

For details on TTS customization features and support, view the TTS API docs. Note that custom pronunciation is supported but not voice transformation.

Dependencies

  1. S3 Compatible Storage

    An S3 compatible storage service must exist that supports HMAC (access key and secret key) credentials. Watson Speech requires one bucket that it can read and write objects to. The bucket will be populated with stock models and additional training data (to faciliatate customization for some of the the recent STT models) at install time. This additional data demands higher storage resoures and longer loading times. The bucket will also store customization artifacts, including custom training data and trained models.

  2. PostgreSQL Database

    A PostgreSQL database is required to manage metadata related to customization.

  3. Kubernetes Cluster

    The Speech services are assumed to be running in a Kubernetes cluster. The commands below take advantage of the kubectl proxy command to route traffic to the services installed in the cluster.

  4. Installs of Watson Text to Speech and Watson Speech to Text Libraries for Embed

    Installing the Speech Embed services with customization requires setting a number of configurations. To make the installation easier, there are Helm charts provided on GitHub at IBM/ibm-watson-embed-charts. For details on how to install see the STT Run with Helm page and the TTS Run with Helm page.

Customization Example

  1. Start a local proxy server to route requests to the services installed in the cluster:

    kubectl proxy
    
  2. Create a new Text to Speech customization

    You create a customization for a specific language, not for a specific voice. A customization can be used with any voice for its specified language. Omit the language parameter to use the the default language, en-US.

    Note that a header must be passed for customization requests. The header key is X-Watson-UserInfo and the required value is bluemix-instance-id=$UUID where $UUID is formatted as a string like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.

    To facilitate copy-pasting the commands below, export the following variables:

    export NAMESPACE=<your-namespace>
    export INSTALL_NAME=<your-install-name-used-with-helm-chart>
    export INSTANCE_ID="00000000-0000-0000-0000-000000000000"
    

    Note that due to the character limit for Kubernetes services and if you used a nameOverride, the URL below may need to be changed after the INSTALL_NAME.

    curl -X POST "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-tts-embed-customization:https/proxy/text-to-speech/api/v1/customizations" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}" \
      --header "Content-Type: application/json" \
      --data '{"name":"MyCustomModel", "language":"en-US", "description": "First example custom language model with acronym translations"}'
    
    {"customization_id": "0fbee6df-7b4a-40b9-a6bc-9ccdcd42fb42"}
    

    Extract the customization id, for example:

    export CUSTOMIZATION_ID="0fbee6df-7b4a-40b9-a6bc-9ccdcd42fb42"
    
  3. View the list of customizations

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-tts-embed-customization:https/proxy/text-to-speech/api/v1/customizations" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}"
    
  4. Update your model with custom word-translation pairs

    curl -X POST "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-tts-embed-customization:https/proxy/text-to-speech/api/v1/customizations/${CUSTOMIZATION_ID}/words" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}" \
      --header "Content-Type: application/json" \
      --data '{"words": [
        {"word": "NCAA", "translation": "N C double A"},
        {"word": "iPhone", "translation": "I phone"},
        {"word": "BTW", "translation": "By the way"},
        {"word": "NYSE", "translation": "New York Stock Exchange"},
        {"word": "TTS", "translation": "Text to Speech"}
      ]}'
    
      {}  # an empty JSON document indicates success
    

    View the customization model. You should see the list of word-translation pairs.

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-tts-embed-customization:https/proxy/speech-to-text/api/v1/customizations/${CUSTOMIZATION_ID}" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}"
    
  5. Use the updated model in a /synthesize call

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-tts-embed-runtime:https/proxy/text-to-speech/api/v1/synthesize?customization_id=$CUSTOMIZATION_ID" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}" \
      --header "Content-Type: application/json" \
      --data '{"text":"This is a simple test of the IBM TTS product. My favorite team reached the NCAA tournament’s final four. I’m thinking of getting a new iPhone next week. BTW. Companies listed in NYSE are showing mixed results."}' \
      --header "Accept: audio/wav" \
      --output tts-result.wav
    

    You can play the output an a Mac:

    afplay tts-result.wav
    

    To see any errors that appear when creating the synthesized audio, remove the output flag.

  6. Send the synthesized audio through Speech-to-Text

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-runtime:https/proxy/speech-to-text/api/v1/recognize" \
    --header "Content-Type: audio/wav" \
    --data-binary @tts-result.wav
    

    Notice that the transcription does not understand the acronyms "i b m" and "nc double a" and it doesn't format "iphone" as iPhone.

  7. Create a Speech to Text custom model to recognize these acronyms

    curl -X POST "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-customization:https/proxy/speech-to-text/api/v1/customizations" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}" \
      --header "Content-Type: application/json" \
      --data '{
        "name":"MyCustomModel",
        "description": "First example language model with custom words",
        "base_model_name": "en-US_Multimedia"
      }'
    
    {"customization_id": "5859a77f-3329-4cdf-948c-28279cd8530b"}
    
    $ export STT_CUSTOMIZATION_ID="5859a77f-3329-4cdf-948c-28279cd8530b"
    
  8. View the list of customizations

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-customization:https/proxy/speech-to-text/api/v1/customizations" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}"
    

    Notice that the model you just created has status Pending. This means the model is created but waiting for training data or analyzing.

  9. Update the model with custom word-sound pairs

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-customization:https/proxy/speech-to-text/api/v1/customizations/${STT_CUSTOMIZATION_ID}/words" \
      -X POST \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}" \
      --header "Content-Type: application/json" \
      --data '{"words": [
        {"word": "IBM", "sounds_like": ["I B M"]},
        {"word": "NCAA", "sounds_like": ["N C double A", "NC double A"]},
        {"word": "iPhone", "sounds_like": ["i phone", "iphone"]},
        {"word": "BTW", "sounds_like": ["by the way"]},
        {"word": "NYSE", "sounds_like": ["New York Stock Exchange"]},
        {"word": "TTS", "sounds_like": ["Text to Speech"]}
      ]}'
    

    View the customization model:

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-customization:https/proxy/speech-to-text/api/v1/customizations/${STT_CUSTOMIZATION_ID}" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}"
    

    Notice that the status of the model is Ready. This means that the model has data and needs training to be Available.

  10. Train the model

    curl -X POST "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-customization:https/proxy/speech-to-text/api/v1/customizations/${STT_CUSTOMIZATION_ID}/train" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}"
    
  11. Use the model in a /recognize call

    curl "http://localhost:8001/api/v1/namespaces/${NAMESPACE}/services/https:${INSTALL_NAME}-ibm-watson-stt-embed-runtime:https/proxy/speech-to-text/api/v1/recognize?customization_id=${STT_CUSTOMIZATION_ID}" \
      --header "x-watson-userinfo: bluemix-instance-id=${INSTANCE_ID}" \
      --header "Content-Type: audio/wav" \
      --data-binary @tts-result.wav