Cognitive Computing

The science of tutoring Watson to understand Korean

Share this post:

Born and raised in India, Abe Ittycheriah, has had a fairly cosmopolitan upbringing. He’s also lived in Mexico City and St. Johns in Newfoundland; he went to high school in New Mexico and did his PhD in Electrical Engineering at Rutgers University in New Jersey. While Abe confesses that he isn’t multilingual, he is fluent in computer languages such as C++, Perl, and others.  Today, with almost 20 years of service under his belt at IBM Research working on speech, question answering, and machine translation technologies, he’s part of a team using computer language skills to help teach Watson Korean, having also schooled Watson in Japanese, French, Italian, Spanish, Brazilian Portuguese, and Arabic.

A formula for teaching Watson a new language

Sample token

Example of breaking down a sentence in Korean into tokens that will allow Watson to process the text

How does Watson expand its dictionary of foreign languages? “Our pattern is simple. There are many people and linguists who know about a language and there’s also a lot of resources on the web. Wikipedia is available in lots of languages so we collect and curate this to use as training data,” said Abe.  “Then we do a sequence of language processing steps to make it easier for the computer to process the text. With many of these languages, including with Korean, words are morphologically rich. Native Korean speakers glue concepts together within a word, but we will separate them out for the computer so the context is correct for those words.  Additionally, pronouns are often not explicitly spelled out in the written language.”

Processing a language is the all-important first step for Watson, enabling the results to be used by algorithms that underpin APIs such as Watson Natural Language Classifier.  The text of a sentence is dissected and presented to the algorithms as tokens. In the case of Korean, the sentence: “엄마사랑해요”(“Mom, I love you”) is broken down into tokens that make it easier for a machine to match with similar statements. A nuance of Korean is that the pronoun “I” is implied or understood so the tokens in the previous example break down as ‘mom, love, do’ and an indication of being polite.  In another example using the question: “ 왓슨은 무엇입니까?” (“What is Watson?”), the tokens are “Watson”, a topic marker, “what”, “be”, and a token indicating that we are asking a question.

Sample token

Example of how “Mom, I Love You!” in Korean is tokenized for the computer to understand

Going from a natural language sentence to a series of tokens the machine understands, requires a lot of training data. It also requires a lot of modeling to help Watson match words that have a lot of synonyms. For example, there are many ways to communicate an adjective such as ‘blue’ and the models help measure the distance between words and sentences so Watson knows its context and meaning, helping it process the word as the color blue or the mood of a person.

Over the next several months, Abe and his colleagues will help Watson master Korean, paving the way for innovators in this market to create cognitive services that can interact more naturally with native Korean speakers. In a market that has one of the world’s highest smartphone penetration rates, Watson’s latest conversational capabilities could bring advanced mobile concierge services to users, improve how consumers engage with businesses via their web sites and call center interactions, and even help introduce new form factors such as customer service oriented robots.

More Cognitive Computing stories

Using SecDevOps to design and embed security and compliance into development workflows

IBM Research has initiated focused efforts called Code Risk Analyzer to bring security and compliance analytics to DevSecOps. Code Risk Analyzer is a new feature of IBM Cloud Continuous Delivery, a cloud service that helps provision toolchains, automate builds and tests, and control quality with analytics.

Continue reading

IBM Research and the Broad Institute Seek to Unravel the True Risks of Genetic Diseases

In 2019, IBM and the Broad Institute of MIT and Harvard started a multi-year collaborative research program to develop powerful predictive models that can potentially enable clinicians to identify patients at serious risk for cardiovascular disease (1, 2). At the start of our collaboration, we proposed an approach to develop AI-based models that combine and […]

Continue reading

Impact of the SQL Relational Model 50 years later

Fifty years ago this month, IBM researcher and computing pioneer Edgar Frank Codd published the seminal paper “A Relational Model of Data for Large Shared Data Banks,” which became the foundation of Structured Query Language (SQL), a language originally built to manage structured data with relational properties. Today SQL is one of the world’s most […]

Continue reading