Freedom and flexibility with Speech-to-Text

By and Bhavik Shah | 4 minute read | December 4, 2017

Key points:
– IBM Watson’s Speech-to-Text service helps you go deeper than out-of-the-box solutions allow by providing the tooling and functionality to train Watson to learn the language of your business
– New language model customization, customization weighting and acoustic model customization features provide the flexibility you need to create effective solutions for your unique domain needs

Your business is unique. It has its own language and modes of operation that require customizable solutions to help your people leverage their industry expertise and deliver value. When it comes to speech-to-text solutions, an out-of-the-box service isn’t enough. Your business needs the freedom and flexibility to create solutions that account for your unique industry and domain needs. Consider the following examples:

  • A call center that needs to transcribe thousands of recorded conversations between customers and agents to identify and analyze common call patterns and issues
  • A medical service that wants to create a dictation application for doctors to more readily capture and log patient diagnoses and treatment notes
  • A retail company that is looking to extend its sales engagement with customers through an online conversational application

Each of these scenarios requires the expression of words, phrases, terms, product names and domain-specific jargon that an out-of-the-box service might not fully understand. The discrete environments of each scenario also vary greatly, thereby affecting transcription unless environment variations are properly accounted for.

Beyond the everyday vernacular a baseline speech recognition service provides, as well as the assumption that your audio files will be crisply recorded without interference, IBM Watson’s Speech-to-Text service helps provide the tooling and functionality to train Watson to learn your business.What makes Watson your best choice?

Language model customization

The ability to tailor our base language models to suit your specific domain terminology is enormously powerful, because no one knows the language of your business better than you.This is one of your differentiating factors in the market, and therefore having a language model customization interface at your fingertips allows you to train the Watson Speech-to-Text service more precisely to suit your business language and style with great accuracy.

Language model customization weighting

Beta release:10/2/2017

You can get even more precise in customizing your language model with our Watson Speech-to-Text service by weighting specific words that may be spoken frequently(such as product names or specific terminology used in your business),as opposed to words already in the service’s base vocabulary.You can account for this during machine training, and you can do so for each speech recognition request as needed.This setting is optional, but depending on your speech recognition needs, having this level of language tuning could greatly benefit your application accuracy.

Acoustic model customization

Beta release:10/2/2017

As a complement to our language model customization interface, we now also offer a custom model interface that attends to the acoustical side of your business, thereby helping you go even further in tailoring the service to your business.Think of it as fine-tuning ‘the ear’ of our service, by training Watson to adapt to your specific acoustic environment (like the ambient noise in your call center) and speaker styles (like voice pitch, volume and pace). You can even create and train an acoustic model by providing just the audio files, without the need for corresponding transcription.

In addition to these customization capabilities, read more about all of our speech-to-text tooling features below:

  • Over eight hours of audio can be supported per recognition request
    IBM Watson’s Speech-to-Text service supports eight audio file formats at varying compressions.The maximum audio length that can be supported in a single speech recognition request is approximately 8 hours and 40 minutes.
  • Speaker labels (beta)
    This feature provides a transcription that labels each speaker’s contributions to a multi-participant conversation, available for audio in U.S.English, Spanish and Japanese.
  • Keyword spotting (beta)
    Identify spoken phrases from the audio that match specified keyword strings with a user-defined level of confidence. This feature is especially useful when individual words or topics from the input are more important than the full transcription. For example, it can be used with a customer support system to determine how to route or categorize a customer request.
  • Word alternatives (beta), confidence and timestamps
    Report alternative words that are acoustically similar to the words being transcribed, confidence levels for each of the words, and timestamps for the start and end of each word.
  • Maximum alternatives and interim results
    Return alternative and interim transcription results. The former provides different possible hypotheses; the latter represent interim hypotheses as the transcription progresses. In both cases, the service indicates final results in which it has the greatest confidence.
  • Profanity filtering
    Censor profanity from U.S.English transcriptions by default. You can use the filtering to sanitize the service’s output.
  • Smart formatting (beta)
    Convert dates, times, numbers, phone numbers and currency values in final transcripts of U.S.English audio into more readable, conventional forms.

Learn how to customize your language models and get started with Watson Speech-to-Text.


Most Popular Articles