Big Data Analytics

Cross-lingual text mining

Share this post:

Discovering knowledge from large volumes of multilingual text data just got easier with new text mining technology from IBM Research. Using globally distributed databases, this cross-lingual text mining technology developed by the research team in Tokyo allows users to search through – and find value in – data written in a language they don’t understand.

Knowledge Discovery

For example, manufacturers selling products in the U.S., Europe and Asia could quickly identify defects, or complaints based on the data from tens of thousands of customer contact reports stored by call center operators in local customer languages. The cross-lingual text mining technology extracts context from portions of the text that the user wishes to analyze, translated to their preferred language. It analyzes and returns results, highlighting irregularities such as defects or complaints that were previously unnoticed, due to language barriers.

“Finding accurate translation pairs (to match one language to another) was a challenge in developing the technology. Often, notes taken by call center operators are not grammatically correct or truncated.” said Tetsuya Nasukawa, a senior technical staff member at IBM Research – Tokyo.

“The terms being analyzed may not be defined in general translation dictionaries. So, this text mining compares how each concept is expressed in the textual database of the source’s native language – and in the textual database of the requested foreign language to determine the translation pairs.”

To go from a search tool, to a technique that extracts valuable information – from any language domain – users can apply toward trend analysis, claim processing, and other fields, the team in Tokyo used TAKMI (text analysis and knowledge mining) to find noteworthy features, trends and important issues without reading all of the data, and additional technology which extracts translation pairs from any language domains.

Last year, IBM’s text mining research team received the Field Innovation Award from The Japanese Society for Artificial Intelligence in recognition of its pioneering text mining research and development effort.

More stories

Gauteng Province Launches COVID-19 Dashboard Developed by IBM Research, Wits University and GCRO – Now Open to the Public

The Gauteng Province has been using data and cloud technologies to monitor and respond to Covid-19, and now they are sharing access with the public. As of 20 August the Gauteng Province in South Africa has 33% of the national cases for COVID-19 with 202,000 confirmed cases — and the numbers continue to rise. To address […]

Continue reading

IBM Research AI Advances Speaker Diarization in Real Use Cases

In a recent publication, IBM researchers describe a novel speaker diarization algorithm that can consider not only speaker information, but also identifying clues about individual recording environments that help differentiate between the speakers, resulting in improved diarization accuracy for our in-house, real test cases as well as public benchmark data.

Continue reading

Largest Dataset for Document Layout Analysis Used to Ingest COVID-19 Data

Documents in Portable Document Format (PDF) are ubiquitous with over 2.5 trillion available from insurance documents to medical files to peer-review scientific articles. It represents one of the main sources of knowledge both online and offline. For example, just recently The White House made available the COVID-19 Open Research Dataset, which included links to 45,826 papers […]

Continue reading