Transliteration as an ETL job using InfoSphere DataStage Java stages and ICU4J

Integrating Java transliteration module with the Java transformer stage of DataStage

With ever growing importance for data quality in growth markets, there is an immediate need to cleanse dirty, unstructured data. However one of the challenges during this exercise is that countries can have multiple languages that create a challenge for effectively handling linguistic data. For example, in India, the official language of each state is different and data is available in both English and local languages, which compounds the problem of data consistency. This article describes how to bring about consistency during the transliteration process, and to use IBM® InfoSphere® Information Server DataStage® to prepare linguistic data as part of an extract, then transform and load an (ETL) scenario.

Share:

Bharath Kumar Devaraju (bhdevara@in.ibm.com), Software Engineer, IBM

Photograph of author BharathKumar DevarajuBharath Kumar Devaraju has worked with IBM since 2009 and is currently working on InfoSphere Streams toolkit development. He is a QualityStage and DataStage certified solution developer. He has worked extensively on customer POCs and assisted in pre-sale activities for growth markets.



16 June 2011

Also available in Chinese Russian Spanish

Introduction

In a growth market region, the very first challenge any cleansing vendor or solution provider encounters is the inconsistency in dialects and linguistics of available data. The first step you should perform is transliteration to bring about consistency in the data before proceeding with data cleansing activities. The amount of data that is involved in a cleansing solution is usually large, so such an exercise is undertaken typically during data warehouse projects, and is best used during an ETL operation. IBM DataStage offers various Java™ stages and the tr4j library to assist in developing and integrating Java programs with ETL jobs. The tr4j library comes bundled with the DataStage installer.

The ICU4J (International Components for Unicode) library is an open source library for Java, widely used by various software vendors to provide globalization and unicode support.

This article shows the steps for developing the transliteration Java program using icu4j and tr4j libraries, and integrating it with the Java transformer stage of DataStage.

Prerequisites

In order to follow the instructions in this article, you will need the following software:

  • InfoSphere DataStage and Information Services Director (ISD) 8.5
  • ICU4J (see Resources section for a link)

In addition, you will need to have basic-level skills with designing and running ETL jobs from DataStage designer, as well as intermediate-level skills with Java programming. The input file should be encoded in UTF-8 or UTF-16 format.


Designing an ETL job using Java transformer stage

The first step is to design an ETL job which reads input from a source file, and has a Java transformer stage to perform transformations. The destination can be a file, a database, or any other processing step.

Perform the following steps to design the job:

  1. Create a new parallel job in the DataStage designer.
  2. From the palette, choose the required job stages. For example, a sequential file as both source and destination, and Java transformer stage.
  3. For the input file stage, configure the metadata and location of the source file. Double click the file stage, and in the Stage tab choose NLS map. Here you must also choose the type of encoding for the input file. For this example choose UTF-8 as shown in Figure 1.
    Figure 1. Specifying encoding type of input file
    Screen showing Sequential_File_0 as stage name and UTF-8 as map name
  4. Repeat these steps for the output file stage. In the end your job should look like Figure 2.
    Figure 2. Transliteration ETL job design
    Job shows Sequential_File_0 on left, going through DSLink3 to Java_Transformer_1, then through DSLink$ to Sequential_File_2

Java program to perform transliteration using the ICU4J and tr4j libraries

Icu4j provides a set of classes that provide regionalization support. The main classes that support transliteration capabilities are Transliterator and Normalizer. See below for more information on these classes.

  • Transliterator: This class provides a transliterate() function that can convert the string of characters from one language script to another. The transliterate function is stateless and thus does not retain information from previous calls. Before using the transliterate() function, you must initialize the Transliterator instance by providing it the source language and destination language required, separated by dash (-). For example: Transliterator.getInstance("Hindi-Latin");
  • Normalizer: This class provides functions to normalize the output of transliteration function into a composed or decomposed form. For example, Latin characters such as A-acute are normalized to single (A) in composed format, or double (AA) in decomposed form.

For a Java program to be embedded in the Java stages of DataStage, it should be in the format specified in Listing 1. The process function should contain all the processing logic to be performed by the Java stage.

Listing 1. Format for Java program to be embedded within Java transformer stage
public class <classname> extends Stage{
	  public void initialize()    {
	        trace("TableSource.initialize");
	       
	    }

	    public void terminate()    {
	        trace("TableSource.terminate");
	    }
    public int process()    {
				.....
		}
		}

Listing 2 shows the sample program for performing transliteration using icu4j and tr4j libraries. This sample code shows that input in any languages is transliterated to Latin alphabet and normalized. Input rows are read in UTF-8 format.

Listing 2. Actual transliteration operation written within process function
public String toBaseCharacters(final String sText) {
		    if (sText == null || sText.length() == 0)
		        return sText;

		    final char[] chars = sText.toCharArray();
		    final int iSize = chars.length;
		    final StringBuilder sb = new StringBuilder(iSize);
		   for (int i = 0; i < iSize; i++) {
		        String sLetter = new String(new char[] { chars[i] });
		        sLetter = Normalizer.normalize(sLetter, Normalizer.NFKD);

		        try {
		            byte[] bLetter = sLetter.getBytes("UTF-8");
		            sb.append((char) bLetter[0]);
		        } catch (UnsupportedEncodingException e) {
		        }
		    }
		    return sb.toString();
		}
		
   public int process()    {
	    	try {
	    		
	         //  do {
	        		Transliterator t=Transliterator.getInstance("Any-Latin");
	        		
	            	Row inputRow = readRow();
	            
	        		
	            	  if (inputRow == null) {
	                      return OUTPUT_STATUS_END_OF_DATA;
	                  }

	                  boolean  reject      = false;
	                  int      columnCount = inputRow.getColumnCount();
	                  Row      outputRow   = createOutputRow();

	   for (int columnNumber = 0; columnNumber < columnCount; columnNumber++) 
		  {
	                String value = inputRow.getValueAsString(columnNumber,"UTF-8");

	                      if ((value == null) || (value.indexOf('*') >= 0)) {
	                          reject = true;
	                          outputRow.setValueAsString(columnNumber, value);
	                      } else {
	          outputRow.setValueAsString(
					columnNumber,toBaseCharacters(
								t.transliterate(value)));
	                      }
	                  }

	                  if (reject) {
	                      rejectRow(outputRow);
	                  } else {
	                      writeRow(outputRow);
	                  }
					......

Integrating Java program with Java transformer stage of the ETL Job

In the properties page of Java transformer stage, provide the details shown in Table 1.

Table 1. Various properties of Java transformer stage to be set
Property nameValues to be inserted
ClasspathMention the entries path of icu4j jar
Specify the path of the folder which contains the Java Programs package deployed in the above example its /opt
Transformer class nameProvide the class name of the transliteration Java program

Figure 3 shows the screen where these properties are set.

Figure 3. Java Transformer Stage properties page
Screen shows input fields for stage name, transformer class name, and user's classpath

After this step you are all set to launch your transliteration ETL job. Compile and run the job. You can also use the output from the job as input to cleansing operations.


Exceptional conditions

Be aware of the following exceptional conditions which may occur:

  • If the job completes, but the output file does not contain the transliterated output, check to see if the input file was saved as something other than UTF encoding, for example UCS. If so, then use an appropriate editor and change the encoding of the input file.
  • DataStage tr4j library provides functions that facilitate debugging by logging messages onto the DataStage director. For example, error() and info().

Conclusion

This article addressed the question of how to perform transliteration using the ICU4J library and the various configuration settings that you must perform. It has shown you how to achieve transliteration using InfoSphere Information Server DataStage. Transliteration will serve as a key component to solving multiple linguistic challenges, and provides a common ground for you to build a standardization rule set with predictability. Thus, this article has shown how to solve your transliteration challenges using Information Server DataStage.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Open source, Java technology
ArticleID=680864
ArticleTitle=Transliteration as an ETL job using InfoSphere DataStage Java stages and ICU4J
publish-date=06162011