Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Simplify loading data from relational databases with WebSphere Information Integrator Omnifind Enterprise Search

Use the DataListener to access Informix and other relational database management systems

Paul Hill (hillpa@au.ibm.com), Technical Specialist, IBM
Prior to working at IBM, Paul was a pre-sales engineer at Informix, specializing in the implementation of DataBlade technology and the use of extended data types (especially spatial). In the last 4 years at IBM, Paul has broadened his focus to include other information management topics such as Enterprise Search and Content Management technologies and their integration into the enterprise.
Chris Chilcott (chlcottc@au.ibm.com), Technical Specialist, IBM
Chris has worked for more than 10 years in the pre-sales area for both Informix and IBM Information Management. He has had firsthand experience with many sites struggling with the issues of finding and analyzing information at the enterprise level.

Summary:  Walk through integrating Informix® with WebSphere® Information Integrator OmniFind Search (OmniFind) technology to gain a wider enterprise searching capability. This same approach can be applied to other relational database management systems in order to access the benefits that OmniFind offers.

Date:  26 Jan 2006
Level:  Introductory
Also available in:   Russian

Activity:  3542 views
Comments:  

Introduction

Relational database management systems (RDBMS) are good at storing and retrieving structured data. SQL provides mature mechanisms for defining and executing a search on structured data but performs poorly on data that is not strongly typed or is not easily indexed (for example: large portions of textual data and binary data). RDBMS today also provide for storing unstructured data and increasingly also semi-structured XML data. Datatypes like VARCHAR and CLOB allow you to store plain ASCII text, and BLOBs provide for storage of complex (proprietary) types of data, like .pdf, .doc, .ppt, .rtf, .xml and so on. SQL provides no mechanisms for defining and executing searches on these types of data.

In the past, some RDBMS attempted to overcome these limitations (to an extent) by providing text indexing 'engines' within the RDBMS code. For example, the Informix database supports the Excalibur Text DataBlade™, which is essentially a text indexing and search engine that plugs into the Informix database.

However, no matter how well an RDBMS supports text indexing and searching, that activity is still constrained to the scope of the data within the particular RDBMS itself. The RDBMS cannot index and support searching on the multitude of data outside the database that enterprises are dependent on, so RDBMS text index/search can never support an enterprise approach to the problem of finding fragments of information within the broader organization. This article uses OmniFind with the DataListener to show how both structured and unstructured data stored in a RDBMS can be 'pushed' to OmniFind. While useful in any context, this article is particularly useful where RDBMS have been used to store unstructured content and there is an interest in making that content available as part of an enterprise text indexing/search initiative.

The article uses, by way of example, an Informix database in which a table stores unstructured content in a BLOB datatype. You'll learn how to access the documents and push them across to OmniFind, where they are subsequently parsed and indexed. With some relatively minor modifications (primarily to the connection string) this code could be adapted to suit any RDBMS.

What is Enterprise Search?

Organizations today store information in a multitude of places across the IT environment. Databases, mail servers, local file systems, project repositories, Content Management systems, Web Content Management systems, and application servers such as CRM all hold key pieces of information and increasingly organizations are using tools capable of searching all of these disparate environments simultaneously. The complexity is further increased when the actual data itself is in a multitude of formats with varying degrees of organization and structure. The data may be stored in structured formats (database), unstructured (files on a file system), or semi-structured formats (XML).

User expectations for finding information (and quickly!) have increased with exposure to Web searching technologies capable of finding words or phrases in millions of Web pages in a matter of seconds through popular search engines such as Google or Yahoo.

Enterprise search technologies provide users with the ability to find relevant information anywhere within an organization using more than the simple word matching capabilities of Internet search. Examples of this include the use of semantic searching ( searching for 'fast car' may match records containing 'Porsche') and parametric searching (find all references occurring between 1990 and 1991).

Prior to the availability of enterprise searching technologies, many organizations attempted a more structured approach to both information storage and searching. The Relational Database and the more recently Object Relational or Universal databases were the generic repositories within these organizations. Data structures were formal and storage mechanisms were limited to the capabilities of the various technologies. This limited the capacity of managing content such as e-mail or documents. Searching was limited to the functionality available within the database, typically wild-card searching within the ANSI SQL standard. Informix provided more sophisticated extensible technology with its DataBlade functionality to allow users to develop their own searching mechanisms, such as Regular Expression searching or Excalibur Text Search.

These mechanisms provide an adequate solution in an environment where the searching is only required against the single database repository. However, when searching across the enterprise is required, the Informix database becomes another source and needs to be accessible from the OmniFind environment.

Using Omnifind with Informix 'out of the box'

Enterprise search engines generally follow a similar approach: they crawl through the data, parse the key data into some sort of common record format, index these records, and then provide the search application and user interface to access this index.

OmniFind implements crawlers to natively support a number of targets (Web, NNTP, file system, IBM Lotus® Domino®, Microsoft® Exchange, IBM Workplace® Web Content Management™) and crawlers that use connectors with Information Integrator (DB2®, Oracle®, Informix, and so on) or Information Integrator Content Edition (FileNet, Hummingbird, and more). See a full listing of supported data sources.

Configuring a crawler for Informix is achieved through defining nickname tables using the DB2 Control Center and then using the wizards in the OmniFind administration to add a DB2 data source.

This provides a general solution for adding Informix tables to OmniFind; however, there are four considerations that will become apparent for configurations of any complexity:

  • The support for datatypes that are outside the basic types supported by Information Integrator (Large Object Locator, SLOB, DataLinks) where content may be stored.
  • The granularity of the content type is to the table level rather than the row level. Documents that may have been stored in a multitude of formats (Microsoft Word, Microsoft PowerPoint, Adobe Acrobat, and so on) in the same table will be difficult to parse as the MIME type can only be assigned to a column as a whole.
  • A connector license is required for each database to be parsed and the software may be required to be installed and maintained on a number of remote servers to access the required databases. If you have extensive Informix skills and are looking to use OmniFind, the addition of DB2 skills to configure and manager Information Integrator may be expensive to develop.
  • Crawling by its very nature is a gathering and scanning activity designed to be performed on an infrequent basis so that the target system is not overly impacted by the activity. This characteristic creates the behavior of the target database being crawled on a nightly or weekly basis, which results in impacts on reporting and maintenance windows, and users potentially being unable to find current information.
  • Sometimes its very useful to have a utility that can be customized (programmed) to work with your particular environment.

The DataListener: Push rather than pull

Crawlers are a pull type of interface -- they attach to the target system, scan for changes, and pull the data back to OmniFind, ready for parsing.

The DataListener provides an interface to OmniFind where content may be pushed from the target system to OmniFind ready for the parser. This allows for only the relevant content as determined at the source system (Informix, in this case) to be pushed and made available for searching. When working with a database as the source, the data could be pushed on a transaction basis using triggers (and perhaps a small bladelet implementation) or scheduled for non-peak periods.

In environments where Informix provides a document or content repository, data can be pushed to OmniFind as required (either triggers or a batch process). The parser will parse the content as it is received, and the indexer can be scheduled to refresh every few minutes (the index refresh only builds a list of what has changed since the last time the index was rebuilt -- a delta -- rather than rebuilding the entire index).

The DataListener interface consists of Java™ APIs, and programmers with reasonably basic skills should be able to reshape the examples to suit their needs. When considering skills and project resources, the required Java skills would be similar to the requirements for customizing the OmniFind search interface. This minimizes the skills diversity to complete the project without the addition requirements of understanding and configuring the WebSphere Information Integrator environment.

Working sample using Informix and the DataListener

There are five key steps in putting together a Java-based utility program to move data from Informix to the DataListener:

  • Configure the JDBC™ environment for Informix and connect to the source database
  • Identify the rows and columns to select from Informix and execute an SQL Select to retrieve the data
  • Configure the DataListener environment for WebSphere Information Integrator OmniFind Edition
  • Push the data to OmniFind
  • Put it all together and test the results!

A small Java utility program is supplied with this article, which provides a working example of the above steps and has an XML configuration file to simplify the connections to OmniFind and Informix; establish the SELECT statement; and the required columns to push to OmniFind. It is called ids2omf and is referred to throughout this article.

This article was written and the examples were tested with the following software products and versions:
-Informix Dynamic Server V10.0UC2
-WebSphere Information Integrator OmniFind Edition V8.2.2
-WebSphere Application Server V5.1.1
-JDK 1.4.2


Figure 1. The pictorial representation of this architecture is:
overview

Step 1: JDBC connection to Informix database

The Java libraries required for developing connections to Informix are in the Informix ClientSDK software package. If your database is on a different host and your local development environment is a different platform (UNIX® database server, Linux® development environment), make sure to download the ClientSDK for the platform where you will be running the Java code.

Add the appropriate lib directory to your CLASSPATH environment variables from the ClientSDK (it will contain ifxjdbc.jar, ifxjdbcx.jar, ifxlang.jar, and so on).

The database server will only accept JDBC connections using the TCP interface. This may need to be configured for your environment and will require the /etc/hosts, onconfig, and sqlhosts files to be edited. See the Informix Infocenter documentation for more information

To create a JDBC connection to Informix, you will need a URL to identify the the target database and supply the appropriate user credentials; for example:

jdbc:informix-sqli://10.0.0.102:40000/stores_demo:INFORMIXSERVER=demo_on_tcp;user=
 informix;password=informix


An example of Java code to connect to Informix using JDBC:
    
...  
static Connection informixConnection = null;
  ...
  
  try {
  Class.forName("com.informix.jdbc.IfxDriver");
  } catch (Exception e) {
  System.out.println("FAILED: failed to load Informix JDBC driver.");
  }
  try {
  informixConnection = DriverManager.getConnection(jdbcURL);
  } catch (SQLException e) {
  System.out.println("FAILED: failed to connect!");
  System.out.println(jdbcURL);
  System.out.println(e.getMessage());
  e.printStackTrace();
  }
  try {
  informixConnection.setReadOnly(true);
  } catch (SQLException e) {
  e.printStackTrace();
  }
  
  The following code is required before the completion of the program to close the connection:
  try {
  informixConnection.close();
  } catch (SQLException e) {
  System.out.println("FAILED: failed to close the connection!");
  }
  System.out.println("Disconnect from source database.");
  

In the above example, on success the active connection is held within the "informixConnection" variable

Step 2: Identify the rows and columns to select from Informix (Select)

When pushing data to OmniFind, several pieces of information are required. This includes the actual content that is searchable and several pieces of metadata, including a unique resource identifier (URI) in order to uniquely identify the row for later retrieval. This URI typically needs a key record value within the table, something like an ID column. If the data being pushed is not plain textual data (Acrobat PDF files stored in a blob), then enough data also needs to be collected to establish the MIME type in order for the parser to execute correctly.

Using the CATALOG table from the STORES_DEMO database as an example, we might be interested in searching the catalog descriptions (cat_descr column). In order to uniquely identify each row we would need the catalog ID number(catalog_num), as well.

The select statement would look like: SELECT catalog_num, cat_descr FROM catalog


The Java code to execute the query is:
    
...
  // Execute the SELECT statement
  stmt = informixConnection.createStatement();
  ResultSet queryResults = stmt.executeQuery("SELECT catalog_num,cat_descr FROM catalog;");
  
  // Get the result set
  ResultSetMetaData queryResultsMetaData = queryResults.getMetaData();
  int numOfCols = queryResultsMetaData.getColumnCount();
  
  // For each row in the result set...
  while (queryResults.next()) {
  totalSelected = totalSelected + 1;
  
  // Build a URI for the row... table,id_column,id	
  uri = "informix://stores_demo/catalog/catalog_num/"+ queryResults.getString(1);	
  content_type = "text/plain";
  
  // Grab the content which is in the cat_descr column
  content = queryResults.getString("cat_descr").getBytes();
  
  // Push the result to the DataListener and keep count of successes
  ...
  
  } catch (Exception e) {
  e.printStackTrace();
  }
...
  

The above example provides a solid starting point for basic textual data. In the attached ids2omf.java file, the coding reflects the need to check the returned data type (a BLOB) and establish the correct MIME type based on the database record.

Step 3: Configure the DataListener environment for OmniFind (Configure)

The DataListener is started by default when OmniFind is started. However, the Listener is unable to accept connections without defining a user within the OmniFind environment and aligning the user credentials with a collection.

The DataListener configuration can be found under the System: DataListener (edit) menu. The scope of this user ID is only for connections to the DataListener within OmniFind.

Note: On this menu the "Collection Name" is used; however, the "Collection_ID" is actually required for the DataListener client code to operate correctly.


Figure 2. Configure DataListener screen
fig2

Checking the monitor menu for the DataListener shows requests waiting to be processed (queued) rather than requests already processed. This screen doesn't provide a lot of useful information when checking to see if your code works (when we get to testing it); however, it is worthwhile to check to make sure the Listener is actually operational.


Figure 3. The DataListener monitoring menu
fig3

Step 4: Develop the appropriate code to connect to the DataListener

The actual push is a two-step process. First, the metadata record needs to be populated, and second, the metadata record and content are actually pushed to the Listener. Note the actual content is handled twice, in the metadata record and in the actual push call.

Most of the metadata fields are not required; however, the content type is required if you intend to push non-text data such as PDF files or Word documents. If the correct content type is not passed, the parser will have problems and generally reject the document (in our experience, the Stellant parser hasn't been particularly successful in establishing the correct document type). Check the log files through the Web-based OmniFind Administration interface to find out if your content has been rejected or ignored by the parser.

The connection to the Listener is not a persistent connection; hence, no "connect, loop until all done, disconnect" coding is required.


Metadata object code
    
    ...
    // Build a default metadata object 
    DataSourceMetadata DSMetaData = DLDataPusher.createDataSourceMetadata(
    "informix",           // Datasource
    "ids2omf",            // Client application identifier
    "informix",           // Datasource name
    0,                    // score (unused)
    new Date(),           // Date 
    "",                   // Language 
    ",                   // Security access tokens
    "text/plain",         // Content type of content (MIME type)
    "",                   // Character set
    content );            // Actual content
  
  // call the push method
  try {
  DLResponse dlRes = DLDataPusher.pushData(hostname, port, clientID,
    passwd, uri, collectionID, DSMetaData, content);
  
  // check the result
  if (dlRes != null)
    if (dlRes.getCodeName() == "SUCCESS")
      recordsPushed = recordsPushed + 1;
    else {
      System.out.println("Problem pushing to OmniFind: URI:"+ uri + "Problem:" 
      + dlRes.toString());
    }
  } catch (Exception e) {
    e.printStackTrace();
  }
  ...

The Datasource, client application identifier, and datasource name fields can be populated with whatever information makes sense in your environment. These values are not required for OmniFind to work properly.

Step 5: Putting it all together

The attached ids2omf.java implements the functionality described in this article. To make the code more flexible, an addition has been made to read and parse an XML parameter file that defines the Informix connection parameters; SELECT statement and key columns; and the OmniFind parameters.

To compile this code, the following JAR files must be accessible on your CLASSPATH environment variable: siapi.jar, es.dl.client.jar, es.dl.server.jar, esapi.jar, ifxjdbc.jar, es.oss.jar, ifxtools.jar, ifxjdbcx.jar, ifxlang.jar, ifxlsupp.jar,ifxsqlj.jar


Running the example code
    
esadmin@linux:~/ids2omf> cat set_classpath.sh

export CLASSPATH=/opt/IBM/es/lib/siapi.jar:/opt/IBM/es/lib/es.dl.client.jar:
/opt/IBM/es/lib/es.dl.server.jar:/opt/IBM/es/lib/esapi.jar:/opt/IBM/es/lib/es.oss.jar:
/opt/IBM/informix/jdbc/lib/ifxjdbc.jar:/opt/IBM/informix/jdbc/lib/ifxtools.jar:
/opt/IBM/informix/jdbc/lib/ifxjdbcx.jar:/opt/IBM/informix/jdbc/lib/ifxlang.jar:
/opt/IBM/informix/jdbc/lib/ifxlsupp.jar:/opt/IBM/informix/jdbc/lib/ifxsqlj.jar:.

esadmin@linux:~/ids2omf> . ./set_classpath.sh

esadmin@linux:~/ids2omf> javac ids2omf.java

esadmin@linux:~/ids2omf> java -cp .:$CLASSPATH ids2omf sources.xml
-------------------------------------------------------------------------
Starting Informix to OmniFind utility
-------------------------------------------------------------------------
Input Parameter file:sources.xml
Informix URL: jdbc:informix-sqli://10.0.0.106:40000/stores_demo:
  INFORMIXSERVER=demo_on_tcp;user=informix;password=informix
OmniFind Target:10.0.0.106:6668:test:test:col_33762
Source selections in parameter file: 1
Connected to source database.
-------------------------------------------------------------------------
        Tables: cust_calls
        Where_clause: ""
        Key column:customer_num
        Data Column: call_descr
        Data Type Column: "text/plain"
Push complete:  Records selected:7, Records pushed:7
Disconnect from source database.
-------------------------------------------------------------------------

esadmin@linux:~/ids2omf>
  

Deprecated DataListener methods

The DataListener interface has been deprecated at Version 8.3 of WebSphere Information Integrator OmniFind Edition in favor of a more streamlined interface. The Java examples used in this article will still function as documented in Version 8.3; however, in future releases the Search and Index API calls should be used.

See Chapter 4 of the Programming Guide and API Reference V8.3 for more information.



Downloads

DescriptionNameSizeDownload method
Sample Java class file for this articleids2omf.java13KBHTTP
Sample XML file for this articlesources.xml2KBHTTP

Information about download methods


Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

About the authors

Prior to working at IBM, Paul was a pre-sales engineer at Informix, specializing in the implementation of DataBlade technology and the use of extended data types (especially spatial). In the last 4 years at IBM, Paul has broadened his focus to include other information management topics such as Enterprise Search and Content Management technologies and their integration into the enterprise.

Chris has worked for more than 10 years in the pre-sales area for both Informix and IBM Information Management. He has had firsthand experience with many sites struggling with the issues of finding and analyzing information at the enterprise level.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, WebSphere
ArticleID=102585
ArticleTitle=Simplify loading data from relational databases with WebSphere Information Integrator Omnifind Enterprise Search
publish-date=01262006
author1-email=hillpa@au.ibm.com
author1-email-cc=
author2-email=chlcottc@au.ibm.com
author2-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers