Transliteration as an ETL job using InfoSphere DataStage Java stages and ICU4J

Integrating Java transliteration module with the Java transformer stage of DataStage

From the developerWorks archives

Bharath Devaraju

Date archived: January 13, 2017 | First published: June 16, 2011

With ever growing importance for data quality in growth markets, there is an immediate need to cleanse dirty, unstructured data. However one of the challenges during this exercise is that countries can have multiple languages that create a challenge for effectively handling linguistic data. For example, in India, the official language of each state is different and data is available in both English and local languages, which compounds the problem of data consistency. This article describes how to bring about consistency during the transliteration process, and to use IBM® InfoSphere® Information Server DataStage® to prepare linguistic data as part of an extract, then transform and load an (ETL) scenario.

This content is no longer being updated or maintained. The full article is provided "as is" in a PDF file. Given the rapid evolution of technology, some steps and illustrations may have changed.



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Open source, Java development
ArticleID=680864
ArticleTitle=Transliteration as an ETL job using InfoSphere DataStage Java stages and ICU4J
publish-date=06162011