Real-time transliteration using InfoSphere Streams custom Java operator and ICU4J

Integrating a Java transliteration module with a custom Java operator of InfoSphere Streams

With the ever growing importance of Internet monitoring and sentiment analysis, there is an immediate need for identifying patterns (performing text analytics) in big data. However, one of the challenges during this exercise is that countries can have multiple languages that create a challenge for effectively running the text analytics, since rules are not available for all the languages. For example, in India, the official language of each state is different, and data is available in both English and local languages. This article describes how to bring about consistency during the transliteration process, and to use IBM® InfoSphere® Streams® to prepare linguistic data and apply text analytics or pattern recognition logic.

Share:

Bharath Kumar Devaraju (bhdevara@in.ibm.com), Software Engineer, IBM China

Author photoBharath Kumar Devaraju has worked with IBM since 2009 and is currently working on InfoSphere Streams toolkit development. He is a QualityStage and DataStage certified solution developer. He has worked extensively on customer POCs, and assisted in pre-sales activities for growth markets.


developerWorks Contributing author
        level

13 December 2012

Also available in Chinese

Introduction

In growth market regions, the very first challenge any solution provider would come across is the inconsistency in dialects and linguistics of available data. As growth market regions have multiple official languages along with English, regional linguistic tokens are embedded along with English tokens. Hence, the first step you should perform is transliteration to bring about consistency in the data before proceeding with processing/text analytics.

Data transliteration provides you with more uniform and consistent results if it is in one predetermined language. This article describes the steps involved in performing real-time transliteration using InfoSphere Streams custom Java operator and ICU4J library. IBM InfoSphere Streams provides the capability to execute the real-time analytics process by offering various toolkits and adapters, which allows you to connect to, and exchange data from various sources and perform operations on them in real time. The high-level implementation architecture of real-time transliteration is shown in Figure 1.

Figure 1. Real-time transliteration high-level solution diagram
realtime transliteration solution architecture

Prerequisites

  • Business prerequisites: You should have basic-level skills with designing and running Streams Processing Language (SPL) application jobs from InfoSphere Streams, and intermediate-level skills with Java programming. The source must be encoded in UTF-8, UTF-16 format.
  • Software prerequisites: InfoSphere Streams (2.0 and above), and ICU4J library.

Creating the transliteration custom Java operator

Perform the following steps to create the transliteration custom Java operator.

  1. Set up the streams studio environment for Java operator development as described in the Streams Information Center.
  2. Once the environment has been set up, write the transliteration logic using the ICU4J library inside of the Java operator. The ICU4J library jar should be imported to your project workspace. The structure of a primitive Java operator in SPL is shown in Listing 1.
    Listing 1. Format of the Java operator in InfoSphere Streams
    public synchronized void initialize(OperatorContext context);
                            
    public void process(StreamingInput<Tuple> inputStream, Tuple tuple);
                            
    public void processPunctuation(StreamingInput<Tuple> inputStream,
                StreamingData.Punctuation marker);
                            
    public void allPortsReady();
                            
    public void shutdown();
  3. The logic of the operator should be inside the process function. Listing 2 shows a sample code.
    Listing 2. Sample code for performing transliteration using Java operator
    public String toBaseCharacters(final String sText) {
            if (sText == null || sText.length() == 0)
                    return sText;
                            
            final char[] chars = sText.toCharArray();
            final int iSize = chars.length;
            final StringBuilder sb = new StringBuilder(iSize);
            for (int i = 0; i < iSize; i++) {
                    String sLetter = new String(new char[] { chars[i] });
                    sLetter = Normalizer.normalize(sLetter, Normalizer.NFKD);
                            
                    try {
                    byte[] bLetter = sLetter.getBytes("UTF-8");
                    sb.append((char) bLetter[0]);
                    } catch (UnsupportedEncodingException e) {
                    }
            }
            return sb.toString();
    }
                            
    public final synchronized void process(final StreamingInput input,
                            final Tuple tuple) throws Exception  {
            try
            {
                    OperatorContext ctxt =getOperatorContext();
                            
                    Transliterator t=Transliterator.getInstance(
                ctxt.getParameterValues("sourceLanguage").get(0)+"-"+
                    ctxt.getParameterValues("destLanguage")
                    .get(0));				
                    StreamingOutput<OutputTuple> output = getOutput(0);
                    OutputTuple outputTuple = output.newTuple();
                    boolean  reject      = false;
                    //read the source tuple
                    String value =  tuple.getString("inp");
                            
                    if ((value == null)) {
                    throw(new Exception("Input is null"));
                    } else {
                    outputTuple.setString("TransliteratedText",
                    toBaseCharacters(t.transliterate(value.toString())));
                    }
                    output.submit(outputTuple);                        
                }
                catch(Exception e)
                {
                            
                }
                .....

    The input text is read, transliterated, and then submitted to the output port. Please note that input should be read as ustring, which is String in Java operator.
  4. Once the operator code is created and compiled, configure the operator model and make it available for use in SPL applications.
  5. For every new operator, a corresponding operator model is created where the entire configuration needs to be specified. A snippet of the operator model is shown in Figure 2.
    Figure 2. External library dependency
    Image shows the operator model, where for every new model, another is created.
  6. The values that need to be set for the various sections in the operator model are shown in Table 1.
Table 1. Custom Java operator model
SectionPropertyDescriptionValue
Context -> Execution SettingsClass NameMention the class name of the Java operator that contains the logic for performing the transliteration. com.ibm.streams.transliteration
Context -> Libraries-> LibraryLibPathMention the absolute path or relative path of ICU4J library jar.../../impl/lib/icu4j.jar
Context -> Libraries-> LibraryLibPathMention the absolute path or relative path of the folder where the operator class file can be located.../../impl/java/bin
Parameters -> ParameterNameA parameter of type rstring to read in the name of the source language. sourceLanguage
Parameters -> ParameterNameA parameter of type rstring to read in the name of language to be transliterated.destLanguage

Using the custom Java operator in a InfoSphere Streams application using SPL

Consider a scenario where you want to understand what people are saying about your product by analyzing the blog posts and social media community posts, which are created for the product. The customers of your product have expressed their opinion in local dialects but it is really challenging to understand the sentiment of these customers because the text analytics rules are not readily available for all dialects.

To overcome this challenge, transliteration can be performed on the source before running the analytics. Listing 3 shows you how transliteration is performed before performing text analytics. The Transliterate operator takes two parameters - source and destination language. If the language of the source is known, then it needs to be set accordingly. If the input language is unknown or has multilingual tokens in it, then it is safer to keep it set to Any.

Listing 3. Sample SPL application using the custom Java operator to perform transliteration
composite Main {
                
        graph
        stream <rstring LingualInput> SourceBlogs = InetSource ()
        {
                param
                URIList:
                        ["http://localblogs.com/fashionwear"];
                initDelay: 5u;
                incrementalFetch: true;
                fetchIntervalSeconds: 60u;
        }	
                
        stream <ustring TransliteratedText> TransliteratedOutput=
        Transliterate(LingualInput)
        {
                param
                        sourceLanguage:"Any";
                        destLanguage:"Latin";
        } 		
                
    stream <rstring text,rstring product,rstring sentiment,rstring sentiment_text> 
        as sentiment = TextExtractor(TransliteratedOutput)
        {
                param
                        AQLFile: "getsentiment.aql";
    }
        () as RssResults = FileSink(sentiment)
        {
            param
                    file: "output.txt";
                    format: csv;
                    hasDelayField: false;
                } 
                
}

Consider the input read from a URL. Here, a customer is expressing negative sentiment on a product of company xyz. The challenge here is an input, which is a multi-lingual text.

मुझे xyz suit पसंद नहीं आया

Once transliterated, the input is converted to an English format.

mujhe xyz suit pasanda nahin aya

The transliteration output provides you a common platform to run text analytics. In the text, pasand and nahin together comprise a negative sentiment. So, this input is passed on to the Text Extractor operator along with the rules in an AQL file, which is configured to handle transliterated local dialect keywords such as pasand, nahin, and so on. Hence, the output of Text Extractor will be shown as follows.

mujhe xyz suit pasanda nahin aya, negative, nahin aaya

Thus, you can successfully extract the sentiment of a customer from a multilingual input.


Brief information about ICU4J

ICU4J provides a set of classes for regionalization support. The main classes that support transliteration capabilities are Transliterator and Normalizer.

  • The Transliterator class provides a transliterate() function, which converts the string of characters from one language script to another. The transliterate function is stateless, which means that it doesn't retain information from previous calls. Before using the transliterate() function, you have to initialize the Transliterator instance by providing it the source language and destination language required, separated by dash (-). For example: Transliterator.getInstance("Hindi-Latin");.
  • The Normalizer class provides functions to normalize the output of the transliteration function into a composed or decomposed format. For example: Latin characters such as A-acute are normalized to single A in composed format, or double A (AA) in decomposed format.

Some of the language conversions supported by ICU4J are as follows.

  • ASCII-Latin
  • Accents-Any
  • Amharic-Latin/BGN
  • Arabic-Latin
  • Bengali-Devanagari
  • Bengali-Latin
  • Kannada-Latin
  • Hindi-Latin
  • Telugu-Latin
  • Tamil-Latin

Exceptional conditions

InfoSphere Streams provides a debugger that supports a real-time application known as Streams Debugger. Streams Debugger provides commands and options that can be easily used to trace and validate the output. You can find more information on debugging with the Streams Debugger in the Resources section.


Conclusion

This article addressed the question of how to perform transliteration using the ICU4J library, and the various configuration settings that one must perform while building the primitive Java operator. Transliteration will serve as a key component to solving multiple linguistic challenges, and provides a common ground for you to run text analytics.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
  • You can find new releases ready for download in the ICU downloads section.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=851292
ArticleTitle=Real-time transliteration using InfoSphere Streams custom Java operator and ICU4J
publish-date=12132012