Cognitive Computing

The science of tutoring Watson to understand Korean

Share this post:

Born and raised in India, Abe Ittycheriah, has had a fairly cosmopolitan upbringing. He’s also lived in Mexico City and St. Johns in Newfoundland; he went to high school in New Mexico and did his PhD in Electrical Engineering at Rutgers University in New Jersey. While Abe confesses that he isn’t multilingual, he is fluent in computer languages such as C++, Perl, and others.  Today, with almost 20 years of service under his belt at IBM Research working on speech, question answering, and machine translation technologies, he’s part of a team using computer language skills to help teach Watson Korean, having also schooled Watson in Japanese, French, Italian, Spanish, Brazilian Portuguese, and Arabic.

A formula for teaching Watson a new language

Sample token

Example of breaking down a sentence in Korean into tokens that will allow Watson to process the text

How does Watson expand its dictionary of foreign languages? “Our pattern is simple. There are many people and linguists who know about a language and there’s also a lot of resources on the web. Wikipedia is available in lots of languages so we collect and curate this to use as training data,” said Abe.  “Then we do a sequence of language processing steps to make it easier for the computer to process the text. With many of these languages, including with Korean, words are morphologically rich. Native Korean speakers glue concepts together within a word, but we will separate them out for the computer so the context is correct for those words.  Additionally, pronouns are often not explicitly spelled out in the written language.”

Processing a language is the all-important first step for Watson, enabling the results to be used by algorithms that underpin APIs such as Watson Natural Language Classifier.  The text of a sentence is dissected and presented to the algorithms as tokens. In the case of Korean, the sentence: “엄마사랑해요”(“Mom, I love you”) is broken down into tokens that make it easier for a machine to match with similar statements. A nuance of Korean is that the pronoun “I” is implied or understood so the tokens in the previous example break down as ‘mom, love, do’ and an indication of being polite.  In another example using the question: “ 왓슨은 무엇입니까?” (“What is Watson?”), the tokens are “Watson”, a topic marker, “what”, “be”, and a token indicating that we are asking a question.

Sample token

Example of how “Mom, I Love You!” in Korean is tokenized for the computer to understand

Going from a natural language sentence to a series of tokens the machine understands, requires a lot of training data. It also requires a lot of modeling to help Watson match words that have a lot of synonyms. For example, there are many ways to communicate an adjective such as ‘blue’ and the models help measure the distance between words and sentences so Watson knows its context and meaning, helping it process the word as the color blue or the mood of a person.

Over the next several months, Abe and his colleagues will help Watson master Korean, paving the way for innovators in this market to create cognitive services that can interact more naturally with native Korean speakers. In a market that has one of the world’s highest smartphone penetration rates, Watson’s latest conversational capabilities could bring advanced mobile concierge services to users, improve how consumers engage with businesses via their web sites and call center interactions, and even help introduce new form factors such as customer service oriented robots.

More Cognitive Computing stories

IBM Research AI at ICASSP 2020

The 45th International Conference on Acoustics, Speech, and Signal Processing is taking place virtually from May 4-8. IBM Research AI is pleased to support the conference as a bronze patron and to share our latest research results, described in nine papers that will be presented at the conference.

Continue reading

The Story Behind IBM’s 2019 Patent Leadership

IBM inventors were awarded 9,262 U.S. patents – topping, once again, the list for the most U.S. patents received, for the 27th year running. That brings the total number of IBM’s U.S. patents to over 140,000.

Continue reading

SysFlow: Scalable System Telemetry for Improved Security Analytics

No organization is safe against cybercrime. Recent studies have shown that these crimes will cost the world well over $5 trillion a year by 2024. Cyber attackers breach corporate networks using a myriad of techniques, with application vulnerabilities corresponding to 25% of all exploitable attack vectors. More disturbing is that these attacks can go unnoticed […]

Continue reading