Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Transliteration as an ETL job using InfoSphere DataStage Java stages and ICU4J

Integrating Java transliteration module with the Java transformer stage of DataStage

Photograph of author BharathKumar Devaraju
Bharath Kumar Devaraju has worked with IBM since 2009, and is currently working on data as a service as part of the IBM cloud computing initiative. He is also a QualityStage certified solution developer. He has worked extensively on customer POCs and assists in pre-sales activities for growth markets.

Summary:  With ever growing importance for data quality in growth markets, there is an immediate need to cleanse dirty, unstructured data. However one of the challenges during this exercise is that countries can have multiple languages that create a challenge for effectively handling linguistic data. For example, in India, the official language of each state is different and data is available in both English and local languages, which compounds the problem of data consistency. This article describes how to bring about consistency during the transliteration process, and to use IBM® InfoSphere® Information Server DataStage® to prepare linguistic data as part of an extract, then transform and load an (ETL) scenario.

Date:  16 Jun 2011
Level:  Intermediate PDF:  A4 and Letter (88KB | 8 pages)Get Adobe® Reader®
Also available in:   Chinese  Korean  Portuguese  Spanish

Activity:  15738 views
Comments:  

Introduction

In a growth market region, the very first challenge any cleansing vendor or solution provider encounters is the inconsistency in dialects and linguistics of available data. The first step you should perform is transliteration to bring about consistency in the data before proceeding with data cleansing activities. The amount of data that is involved in a cleansing solution is usually large, so such an exercise is undertaken typically during data warehouse projects, and is best used during an ETL operation. IBM DataStage offers various Java™ stages and the tr4j library to assist in developing and integrating Java programs with ETL jobs. The tr4j library comes bundled with the DataStage installer.

The ICU4J (International Components for Unicode) library is an open source library for Java, widely used by various software vendors to provide globalization and unicode support.

This article shows the steps for developing the transliteration Java program using icu4j and tr4j libraries, and integrating it with the Java transformer stage of DataStage.

Prerequisites

In order to follow the instructions in this article, you will need the following software:

  • InfoSphere DataStage and Information Services Director (ISD) 8.5
  • ICU4J (see Resources section for a link)

In addition, you will need to have basic-level skills with designing and running ETL jobs from DataStage designer, as well as intermediate-level skills with Java programming. The input file should be encoded in UTF-8 or UTF-16 format.


Designing an ETL job using Java transformer stage

The first step is to design an ETL job which reads input from a source file, and has a Java transformer stage to perform transformations. The destination can be a file, a database, or any other processing step.

Perform the following steps to design the job:

  1. Create a new parallel job in the DataStage designer.
  2. From the palette, choose the required job stages. For example, a sequential file as both source and destination, and Java transformer stage.
  3. For the input file stage, configure the metadata and location of the source file. Double click the file stage, and in the Stage tab choose NLS map. Here you must also choose the type of encoding for the input file. For this example choose UTF-8 as shown in Figure 1.

    Figure 1. Specifying encoding type of input file
    Screen showing Sequential_File_0 as stage name and UTF-8 as map name

  4. Repeat these steps for the output file stage. In the end your job should look like Figure 2.

    Figure 2. Transliteration ETL job design
    Job shows Sequential_File_0 on left, going through DSLink3 to Java_Transformer_1, then through DSLink$ to Sequential_File_2


Java program to perform transliteration using the ICU4J and tr4j libraries

Icu4j provides a set of classes that provide regionalization support. The main classes that support transliteration capabilities are Transliterator and Normalizer. See below for more information on these classes.

  • Transliterator: This class provides a transliterate() function that can convert the string of characters from one language script to another. The transliterate function is stateless and thus does not retain information from previous calls. Before using the transliterate() function, you must initialize the Transliterator instance by providing it the source language and destination language required, separated by dash (-). For example: Transliterator.getInstance("Hindi-Latin");
  • Normalizer: This class provides functions to normalize the output of transliteration function into a composed or decomposed form. For example, Latin characters such as A-acute are normalized to single (A) in composed format, or double (AA) in decomposed form.

For a Java program to be embedded in the Java stages of DataStage, it should be in the format specified in Listing 1. The process function should contain all the processing logic to be performed by the Java stage.


Listing 1. Format for Java program to be embedded within Java transformer stage

public class <classname> extends Stage{
	  public void initialize()    {
	        trace("TableSource.initialize");
	       
	    }

	    public void terminate()    {
	        trace("TableSource.terminate");
	    }
    public int process()    {
				.....
		}
		}

Listing 2 shows the sample program for performing transliteration using icu4j and tr4j libraries. This sample code shows that input in any languages is transliterated to Latin alphabet and normalized. Input rows are read in UTF-8 format.


Listing 2. Actual transliteration operation written within process function

public String toBaseCharacters(final String sText) {
		    if (sText == null || sText.length() == 0)
		        return sText;

		    final char[] chars = sText.toCharArray();
		    final int iSize = chars.length;
		    final StringBuilder sb = new StringBuilder(iSize);
		   for (int i = 0; i < iSize; i++) {
		        String sLetter = new String(new char[] { chars[i] });
		        sLetter = Normalizer.normalize(sLetter, Normalizer.NFKD);

		        try {
		            byte[] bLetter = sLetter.getBytes("UTF-8");
		            sb.append((char) bLetter[0]);
		        } catch (UnsupportedEncodingException e) {
		        }
		    }
		    return sb.toString();
		}
		
   public int process()    {
	    	try {
	    		
	         //  do {
	        		Transliterator t=Transliterator.getInstance("Any-Latin");
	        		
	            	Row inputRow = readRow();
	            
	        		
	            	  if (inputRow == null) {
	                      return OUTPUT_STATUS_END_OF_DATA;
	                  }

	                  boolean  reject      = false;
	                  int      columnCount = inputRow.getColumnCount();
	                  Row      outputRow   = createOutputRow();

	   for (int columnNumber = 0; columnNumber < columnCount; columnNumber++) 
		  {
	                String value = inputRow.getValueAsString(columnNumber,"UTF-8");

	                      if ((value == null) || (value.indexOf('*') >= 0)) {
	                          reject = true;
	                          outputRow.setValueAsString(columnNumber, value);
	                      } else {
	          outputRow.setValueAsString(
					columnNumber,toBaseCharacters(
								t.transliterate(value)));
	                      }
	                  }

	                  if (reject) {
	                      rejectRow(outputRow);
	                  } else {
	                      writeRow(outputRow);
	                  }
					......



Integrating Java program with Java transformer stage of the ETL Job

In the properties page of Java transformer stage, provide the details shown in Table 1.


Table 1. Various properties of Java transformer stage to be set

Property nameValues to be inserted
ClasspathMention the entries path of icu4j jar
Specify the path of the folder which contains the Java Programs package deployed in the above example its /opt
Transformer class nameProvide the class name of the transliteration Java program

Figure 3 shows the screen where these properties are set.


Figure 3. Java Transformer Stage properties page
Screen shows input fields for stage name, transformer class name, and user's classpath

After this step you are all set to launch your transliteration ETL job. Compile and run the job. You can also use the output from the job as input to cleansing operations.


Exceptional conditions

Be aware of the following exceptional conditions which may occur:

  • If the job completes, but the output file does not contain the transliterated output, check to see if the input file was saved as something other than UTF encoding, for example UCS. If so, then use an appropriate editor and change the encoding of the input file.
  • DataStage tr4j library provides functions that facilitate debugging by logging messages onto the DataStage director. For example, error() and info().

Conclusion

This article addressed the question of how to perform transliteration using the ICU4J library and the various configuration settings that you must perform. It has shown you how to achieve transliteration using InfoSphere Information Server DataStage. Transliteration will serve as a key component to solving multiple linguistic challenges, and provides a common ground for you to build a standardization rule set with predictability. Thus, this article has shown how to solve your transliteration challenges using Information Server DataStage.


Resources

Learn

Get products and technologies

Discuss

About the author

Photograph of author BharathKumar Devaraju

Bharath Kumar Devaraju has worked with IBM since 2009, and is currently working on data as a service as part of the IBM cloud computing initiative. He is also a QualityStage certified solution developer. He has worked extensively on customer POCs and assists in pre-sales activities for growth markets.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Open source, Java technology
ArticleID=680864
ArticleTitle=Transliteration as an ETL job using InfoSphere DataStage Java stages and ICU4J
publish-date=06162011
author1-email=bhdevara@in.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers