In a growth market region, the very first challenge any cleansing vendor or
solution provider encounters is the inconsistency in dialects and
linguistics of available data. The first step you should perform is
transliteration to bring about consistency in the data before proceeding
with data cleansing activities. The amount of data that is involved in a
cleansing solution is usually large, so such an exercise is undertaken
typically during data warehouse projects, and is best used during an ETL
operation. IBM DataStage offers various Java™ stages and the
tr4j library to assist in developing and
integrating Java programs with ETL jobs. The
tr4j library comes bundled with the DataStage
installer.
The ICU4J (International Components for Unicode)
library is an open source library for Java, widely used by various
software vendors to provide globalization and unicode support.
This article shows the steps for developing the transliteration Java
program using icu4j and
tr4j libraries, and integrating it with the
Java transformer stage of DataStage.
In order to follow the instructions in this article, you will need the following software:
- InfoSphere DataStage and Information Services Director (ISD) 8.5
- ICU4J (see Resources section for a link)
In addition, you will need to have basic-level skills with designing and running ETL jobs from DataStage designer, as well as intermediate-level skills with Java programming. The input file should be encoded in UTF-8 or UTF-16 format.
Designing an ETL job using Java transformer stage
The first step is to design an ETL job which reads input from a source file, and has a Java transformer stage to perform transformations. The destination can be a file, a database, or any other processing step.
Perform the following steps to design the job:
- Create a new parallel job in the DataStage designer.
- From the palette, choose the required job stages. For example, a sequential file as both source and destination, and Java transformer stage.
- For the input file stage, configure the metadata and location of the
source file. Double click the file stage, and in the
Stage tab choose NLS map. Here
you must also choose the type of encoding for the input file. For this
example choose UTF-8 as shown in Figure 1.
Figure 1. Specifying encoding type of input file
- Repeat these steps for the output file stage. In the end your job
should look like Figure 2.
Figure 2. Transliteration ETL job design
Java program to perform transliteration using the ICU4J and tr4j libraries
Icu4j provides a set of classes that provide
regionalization support. The main classes that support transliteration
capabilities are Transliterator and Normalizer. See below for more
information on these classes.
- Transliterator: This class provides a transliterate() function that
can convert the string of characters from one language script to
another. The transliterate function is stateless and thus does not
retain information from previous calls. Before using the
transliterate() function, you must initialize the Transliterator
instance by providing it the source language and destination language
required, separated by dash (-). For example:
Transliterator.getInstance("Hindi-Latin"); - Normalizer: This class provides functions to normalize the output of transliteration function into a composed or decomposed form. For example, Latin characters such as A-acute are normalized to single (A) in composed format, or double (AA) in decomposed form.
For a Java program to be embedded in the Java stages of DataStage, it should be in the format specified in Listing 1. The process function should contain all the processing logic to be performed by the Java stage.
Listing 1. Format for Java program to be embedded within Java transformer stage
public class <classname> extends Stage{
public void initialize() {
trace("TableSource.initialize");
}
public void terminate() {
trace("TableSource.terminate");
}
public int process() {
.....
}
}
|
Listing 2 shows the sample program for performing transliteration using
icu4j and tr4j
libraries. This sample code shows that input in any languages is
transliterated to Latin alphabet and normalized. Input rows are read in
UTF-8 format.
Listing 2. Actual transliteration operation written within process function
public String toBaseCharacters(final String sText) {
if (sText == null || sText.length() == 0)
return sText;
final char[] chars = sText.toCharArray();
final int iSize = chars.length;
final StringBuilder sb = new StringBuilder(iSize);
for (int i = 0; i < iSize; i++) {
String sLetter = new String(new char[] { chars[i] });
sLetter = Normalizer.normalize(sLetter, Normalizer.NFKD);
try {
byte[] bLetter = sLetter.getBytes("UTF-8");
sb.append((char) bLetter[0]);
} catch (UnsupportedEncodingException e) {
}
}
return sb.toString();
}
public int process() {
try {
// do {
Transliterator t=Transliterator.getInstance("Any-Latin");
Row inputRow = readRow();
if (inputRow == null) {
return OUTPUT_STATUS_END_OF_DATA;
}
boolean reject = false;
int columnCount = inputRow.getColumnCount();
Row outputRow = createOutputRow();
for (int columnNumber = 0; columnNumber < columnCount; columnNumber++)
{
String value = inputRow.getValueAsString(columnNumber,"UTF-8");
if ((value == null) || (value.indexOf('*') >= 0)) {
reject = true;
outputRow.setValueAsString(columnNumber, value);
} else {
outputRow.setValueAsString(
columnNumber,toBaseCharacters(
t.transliterate(value)));
}
}
if (reject) {
rejectRow(outputRow);
} else {
writeRow(outputRow);
}
......
|
Integrating Java program with Java transformer stage of the ETL Job
In the properties page of Java transformer stage, provide the details shown in Table 1.
Table 1. Various properties of Java transformer stage to be set
| Property name | Values to be inserted |
|---|---|
| Classpath | Mention the entries path of icu4j jar |
| Specify the path of the folder which contains the Java Programs package deployed in the above example its /opt | |
| Transformer class name | Provide the class name of the transliteration Java program |
Figure 3 shows the screen where these properties are set.
Figure 3. Java Transformer Stage properties page
After this step you are all set to launch your transliteration ETL job. Compile and run the job. You can also use the output from the job as input to cleansing operations.
Be aware of the following exceptional conditions which may occur:
- If the job completes, but the output file does not contain the transliterated output, check to see if the input file was saved as something other than UTF encoding, for example UCS. If so, then use an appropriate editor and change the encoding of the input file.
- DataStage
tr4jlibrary provides functions that facilitate debugging by logging messages onto the DataStage director. For example,error()andinfo().
This article addressed the question of how to perform transliteration using
the ICU4J library and the various configuration
settings that you must perform. It has shown you how to achieve
transliteration using InfoSphere Information Server DataStage.
Transliteration will serve as a key component to solving multiple
linguistic challenges, and provides a common ground for you to build a
standardization rule set with predictability. Thus, this article has shown
how to solve your transliteration challenges using Information Server
DataStage.
Learn
- Learn more about the ICU library and ICU4J
at the ICU home
page.
- Get more information about DataStage in
the IBM Redbooks® publication IBM
InfoSphere DataStage Data Flow and Job Design.
- Get the resources you need in the Information Management
area on developerWorks, to advance your skills on a wide variety
of IBM Information Management products.
- Visit the InfoSphere area on developerWorks to read articles and tutorials,
access forums and documentation, and connect to other resources to expand
your InfoSphere skills.
- Learn more about Information Management at
the developerWorks
Information Management zone. Find technical documentation, how-to
articles, education, downloads, product information, and more.
- Follow developerWorks on
Twitter.
- Watch developerWorks on-demand demos
ranging from product installation and setup demos for beginners, to
advanced functionality for experienced developers.
Get products and technologies
- Find new releases
ready for download in the downloads section of the ICU site.
- Learn how you can
use InfoSphere DataStage and InfoSphere QualityStage on the Amazon EC2
Web Service.
- Build your next
development project with IBM trial
software, available for download directly from developerWorks, or
spend a few hours in the SOA Sandbox learning how to
implement Service Oriented Architecture efficiently..
Discuss
- Participate in the discussion forum.
- Check out the developerWorks
blogs and get involved in the developerWorks
community.

Bharath Kumar Devaraju has worked with IBM since 2009, and is currently working on data as a service as part of the IBM cloud computing initiative. He is also a QualityStage certified solution developer. He has worked extensively on customer POCs and assists in pre-sales activities for growth markets.




