Automatic Speech Recognition – Are All Tests Comparable?
– Access to appropriate domain data is the dominant factor in determining speech recognition performance. For this reason, Watson offers a cloud-based API with a general model with the option to customize.
– One system’s general speech recognition model shouldn’t be compared to other speech models that have been pre-customized with data that is not universally accessible.
– An important point of distinction is Watson’s ability to allow businesses to opt to build their own customized model versus contributing their data to a central database. This allows the client to maintain control of their critical private and proprietary information.
Automatic speech recognition, the ability to identify words and phrases in spoken language and converting them to text in real-time, provides nearly endless opportunities for the humans that use these AI systems, from improving customer satisfaction or enabling remote communication between doctors and patients to improving accessibility for the deaf or the blind. Platforms and applications built on automatic speech recognition are only as good as the system’s understanding of language, and the way this understanding is measured. Achieving human parity, meaning an error rate on-par with that of a human listening to two people in conversation, has long remained a significant industry challenge – as has measuring it consistently.
For accurate comparison of systems, training and testing must be consistent
First, training and testing must be consistent, especially on highly customized data sets. Consider this: if you suffered from migraines, would you seek medical advice from a neurologist or a podiatrist? While both are skilled doctors, they are experts in differing domains, with a neurologist trained specifically on the language of the brain. Would it be accurate to judge the podiatrist against the neurologist in this instance? Similarly, for speech recognition systems and the models they apply to be truly comparable, it is necessary for both systems to be trained and tested on the same data. Using proprietary data upon which one systems is already highly trained while the other is not, can greatly skew outcomes.
For this reason, benchmarks for speech recognition systems including the SWITCHBOARD corpus that IBM regularly reports on have been considered the standard controlled data set for automatic speech recognition testing for twenty years and counting. Another industry corpus, known as “CallHome,” is also available for benchmark testing and is widely used for determining system WER figures. By using SWITCHBOARD or CallHome, systems are tested against the exact same data set, eliminating additional variables that may skew findings.
Access to domain data is a critical factor in measuring performance
Second, it is inaccurate and misleading to compare one system’s general model against other speech models that have been pre-customized with data that are not universally accessible. Access to appropriate domain data is the dominant factor in determining speech recognition performance. For this reason, Watson offers a cloud-based API with a general model with the option to customize. An important point of distinction is the Watson system’s ability to allow businesses to opt to build their own customized model versus contributing their data to a central database. This allows the client to maintain control of their critical private and proprietary information.
Rather than providing a highly specialized data set upon which businesses build their applications, Watson’s differentiation is in that it provides the tools and services customers can use to build recognition capabilities in their respective domains against their own data sets. For example, Invoca uses the Watson Speech Recognition API to drive intelligent marketing insights from phone call data. Another example is Mizuho Bank in Japan, which uses Watson Speech Recognition API to provide real-time relevant information to call center agents to better prepare to respond to customers in real-time.
As AI technology becomes more mainstream in its applications across industries and disciplines, the capability for systems to understand natural language and interpret it accurately will be monumentally important. It will be the responsibility of all organizations involved in the advancement of AI to appropriately train and test systems against recognized, standardized methodologies in an ethical, accountable and impactful way.
Learn more about Watson Speech to Text
Watson Speech to Text converts audio voice into written text. Use the Speech to Text API to transcribe calls in a contact center, to identify what is being discussed, when to escalate calls, and to understand content from multiple speakers. You can also use Speech to Text to create voice-controlled applications – even customize the model to improve accuracy for the language and content you care about most such as product names, sensitive subjects, or names of individuals.
Learn more about Watson’s Speech-to-Text API and sign up for a 30-day free trial.
(This post was co-authored by Michael Picheny, George Saon and Bhavik Shah)