Relational database management systems (RDBMS) are good at storing and retrieving structured data. SQL provides mature mechanisms for defining and executing a search on structured data but performs poorly on data that is not strongly typed or is not easily indexed (for example: large portions of textual data and binary data). RDBMS today also provide for storing unstructured data and increasingly also semi-structured XML data. Datatypes like VARCHAR and CLOB allow you to store plain ASCII text, and BLOBs provide for storage of complex (proprietary) types of data, like .pdf, .doc, .ppt, .rtf, .xml and so on. SQL provides no mechanisms for defining and executing searches on these types of data.
In the past, some RDBMS attempted to overcome these limitations (to an extent) by providing text indexing 'engines' within the RDBMS code. For example, the Informix database supports the Excalibur Text DataBlade™, which is essentially a text indexing and search engine that plugs into the Informix database.
However, no matter how well an RDBMS supports text indexing and searching, that activity is still constrained to the scope of the data within the particular RDBMS itself. The RDBMS cannot index and support searching on the multitude of data outside the database that enterprises are dependent on, so RDBMS text index/search can never support an enterprise approach to the problem of finding fragments of information within the broader organization. This article uses OmniFind with the DataListener to show how both structured and unstructured data stored in a RDBMS can be 'pushed' to OmniFind. While useful in any context, this article is particularly useful where RDBMS have been used to store unstructured content and there is an interest in making that content available as part of an enterprise text indexing/search initiative.
The article uses, by way of example, an Informix database in which a table stores unstructured content in a BLOB datatype. You'll learn how to access the documents and push them across to OmniFind, where they are subsequently parsed and indexed. With some relatively minor modifications (primarily to the connection string) this code could be adapted to suit any RDBMS.
Organizations today store information in a multitude of places across the IT environment. Databases, mail servers, local file systems, project repositories, Content Management systems, Web Content Management systems, and application servers such as CRM all hold key pieces of information and increasingly organizations are using tools capable of searching all of these disparate environments simultaneously. The complexity is further increased when the actual data itself is in a multitude of formats with varying degrees of organization and structure. The data may be stored in structured formats (database), unstructured (files on a file system), or semi-structured formats (XML).
User expectations for finding information (and quickly!) have increased with exposure to Web searching technologies capable of finding words or phrases in millions of Web pages in a matter of seconds through popular search engines such as Google or Yahoo.
Enterprise search technologies provide users with the ability to find relevant information anywhere within an organization using more than the simple word matching capabilities of Internet search. Examples of this include the use of semantic searching ( searching for 'fast car' may match records containing 'Porsche') and parametric searching (find all references occurring between 1990 and 1991).
Prior to the availability of enterprise searching technologies, many organizations attempted a more structured approach to both information storage and searching. The Relational Database and the more recently Object Relational or Universal databases were the generic repositories within these organizations. Data structures were formal and storage mechanisms were limited to the capabilities of the various technologies. This limited the capacity of managing content such as e-mail or documents. Searching was limited to the functionality available within the database, typically wild-card searching within the ANSI SQL standard. Informix provided more sophisticated extensible technology with its DataBlade functionality to allow users to develop their own searching mechanisms, such as Regular Expression searching or Excalibur Text Search.
These mechanisms provide an adequate solution in an environment where the searching is only required against the single database repository. However, when searching across the enterprise is required, the Informix database becomes another source and needs to be accessible from the OmniFind environment.
Using Omnifind with Informix 'out of the box'
Enterprise search engines generally follow a similar approach: they crawl through the data, parse the key data into some sort of common record format, index these records, and then provide the search application and user interface to access this index.
OmniFind implements crawlers to natively support a number of targets (Web, NNTP, file system, IBM Lotus® Domino®, Microsoft® Exchange, IBM Workplace® Web Content Management™) and crawlers that use connectors with Information Integrator (DB2®, Oracle®, Informix, and so on) or Information Integrator Content Edition (FileNet, Hummingbird, and more). See a full listing of supported data sources.
Configuring a crawler for Informix is achieved through defining nickname tables using the DB2 Control Center and then using the wizards in the OmniFind administration to add a DB2 data source.
This provides a general solution for adding Informix tables to OmniFind; however, there are four considerations that will become apparent for configurations of any complexity:
- The support for datatypes that are outside the basic types supported by Information Integrator (Large Object Locator, SLOB, DataLinks) where content may be stored.
- The granularity of the content type is to the table level rather than the row level. Documents that may have been stored in a multitude of formats (Microsoft Word, Microsoft PowerPoint, Adobe Acrobat, and so on) in the same table will be difficult to parse as the MIME type can only be assigned to a column as a whole.
- A connector license is required for each database to be parsed and the software may be required to be installed and maintained on a number of remote servers to access the required databases. If you have extensive Informix skills and are looking to use OmniFind, the addition of DB2 skills to configure and manager Information Integrator may be expensive to develop.
- Crawling by its very nature is a gathering and scanning activity designed to be performed on an infrequent basis so that the target system is not overly impacted by the activity. This characteristic creates the behavior of the target database being crawled on a nightly or weekly basis, which results in impacts on reporting and maintenance windows, and users potentially being unable to find current information.
- Sometimes its very useful to have a utility that can be customized (programmed) to work with your particular environment.
The DataListener: Push rather than pull
Crawlers are a pull type of interface -- they attach to the target system, scan for changes, and pull the data back to OmniFind, ready for parsing.
The DataListener provides an interface to OmniFind where content may be pushed from the target system to OmniFind ready for the parser. This allows for only the relevant content as determined at the source system (Informix, in this case) to be pushed and made available for searching. When working with a database as the source, the data could be pushed on a transaction basis using triggers (and perhaps a small bladelet implementation) or scheduled for non-peak periods.
In environments where Informix provides a document or content repository, data can be pushed to OmniFind as required (either triggers or a batch process). The parser will parse the content as it is received, and the indexer can be scheduled to refresh every few minutes (the index refresh only builds a list of what has changed since the last time the index was rebuilt -- a delta -- rather than rebuilding the entire index).
The DataListener interface consists of Java™ APIs, and programmers with reasonably basic skills should be able to reshape the examples to suit their needs. When considering skills and project resources, the required Java skills would be similar to the requirements for customizing the OmniFind search interface. This minimizes the skills diversity to complete the project without the addition requirements of understanding and configuring the WebSphere Information Integrator environment.
Working sample using Informix and the DataListener
There are five key steps in putting together a Java-based utility program to move data from Informix to the DataListener:
- Configure the JDBC™ environment for Informix and connect to the source database
- Identify the rows and columns to select from Informix and execute an SQL Select to retrieve the data
- Configure the DataListener environment for WebSphere Information Integrator OmniFind Edition
- Push the data to OmniFind
- Put it all together and test the results!
A small Java utility program is supplied with this article, which provides a working example of the above steps and has an XML configuration file to simplify the connections to OmniFind and Informix; establish the SELECT statement; and the required columns to push to OmniFind. It is called ids2omf and is referred to throughout this article.
This article was written and the examples were tested with the following software products and versions:
-Informix Dynamic Server V10.0UC2
-WebSphere Information Integrator OmniFind Edition V8.2.2
-WebSphere Application Server V5.1.1
-JDK 1.4.2
Figure 1. The pictorial representation of this architecture is:
Step 1: JDBC connection to Informix database
The Java libraries required for developing connections to Informix are in the Informix ClientSDK software package. If your database is on a different host and your local development environment is a different platform (UNIX® database server, Linux® development environment), make sure to download the ClientSDK for the platform where you will be running the Java code.
Add the appropriate lib directory to your CLASSPATH environment variables from the ClientSDK (it will contain ifxjdbc.jar, ifxjdbcx.jar, ifxlang.jar, and so on).
The database server will only accept JDBC connections using the TCP interface. This may need to be configured for your environment and will require the /etc/hosts, onconfig, and sqlhosts files to be edited. See the Informix Infocenter documentation for more information
To create a JDBC connection to Informix, you will need a URL to identify the the target database and supply the appropriate user credentials; for example:
jdbc:informix-sqli://10.0.0.102:40000/stores_demo:INFORMIXSERVER=demo_on_tcp;user= informix;password=informix |
An example of Java code to connect to Informix using JDBC:
...
static Connection informixConnection = null;
...
try {
Class.forName("com.informix.jdbc.IfxDriver");
} catch (Exception e) {
System.out.println("FAILED: failed to load Informix JDBC driver.");
}
try {
informixConnection = DriverManager.getConnection(jdbcURL);
} catch (SQLException e) {
System.out.println("FAILED: failed to connect!");
System.out.println(jdbcURL);
System.out.println(e.getMessage());
e.printStackTrace();
}
try {
informixConnection.setReadOnly(true);
} catch (SQLException e) {
e.printStackTrace();
}
The following code is required before the completion of the program to close the connection:
try {
informixConnection.close();
} catch (SQLException e) {
System.out.println("FAILED: failed to close the connection!");
}
System.out.println("Disconnect from source database.");
|
In the above example, on success the active connection is held within the "informixConnection" variable
Step 2: Identify the rows and columns to select from Informix (Select)
When pushing data to OmniFind, several pieces of information are required. This includes the actual content that is searchable and several pieces of metadata, including a unique resource identifier (URI) in order to uniquely identify the row for later retrieval. This URI typically needs a key record value within the table, something like an ID column. If the data being pushed is not plain textual data (Acrobat PDF files stored in a blob), then enough data also needs to be collected to establish the MIME type in order for the parser to execute correctly.
Using the CATALOG table from the STORES_DEMO database as an example, we might be interested in searching the catalog descriptions (cat_descr column). In order to uniquely identify each row we would need the catalog ID number(catalog_num), as well.
The select statement would look like:
SELECT catalog_num, cat_descr
FROM catalog
The Java code to execute the query is:
...
// Execute the SELECT statement
stmt = informixConnection.createStatement();
ResultSet queryResults = stmt.executeQuery("SELECT catalog_num,cat_descr FROM catalog;");
// Get the result set
ResultSetMetaData queryResultsMetaData = queryResults.getMetaData();
int numOfCols = queryResultsMetaData.getColumnCount();
// For each row in the result set...
while (queryResults.next()) {
totalSelected = totalSelected + 1;
// Build a URI for the row... table,id_column,id
uri = "informix://stores_demo/catalog/catalog_num/"+ queryResults.getString(1);
content_type = "text/plain";
// Grab the content which is in the cat_descr column
content = queryResults.getString("cat_descr").getBytes();
// Push the result to the DataListener and keep count of successes
...
} catch (Exception e) {
e.printStackTrace();
}
...
|
The above example provides a solid starting point for basic textual data. In the attached ids2omf.java file, the coding reflects the need to check the returned data type (a BLOB) and establish the correct MIME type based on the database record.
Step 3: Configure the DataListener environment for OmniFind (Configure)
The DataListener is started by default when OmniFind is started. However, the Listener is unable to accept connections without defining a user within the OmniFind environment and aligning the user credentials with a collection.
The DataListener configuration can be found under the System: DataListener (edit) menu. The scope of this user ID is only for connections to the DataListener within OmniFind.
Note: On this menu the "Collection Name" is used; however, the "Collection_ID" is actually required for the DataListener client code to operate correctly.
Figure 2. Configure DataListener screen
Checking the monitor menu for the DataListener shows requests waiting to be processed (queued) rather than requests already processed. This screen doesn't provide a lot of useful information when checking to see if your code works (when we get to testing it); however, it is worthwhile to check to make sure the Listener is actually operational.
Figure 3. The DataListener monitoring menu
Step 4: Develop the appropriate code to connect to the DataListener
The actual push is a two-step process. First, the metadata record needs to be populated, and second, the metadata record and content are actually pushed to the Listener. Note the actual content is handled twice, in the metadata record and in the actual push call.
Most of the metadata fields are not required; however, the content type is required if you intend to push non-text data such as PDF files or Word documents. If the correct content type is not passed, the parser will have problems and generally reject the document (in our experience, the Stellant parser hasn't been particularly successful in establishing the correct document type). Check the log files through the Web-based OmniFind Administration interface to find out if your content has been rejected or ignored by the parser.
The connection to the Listener is not a persistent connection; hence, no "connect, loop until all done, disconnect" coding is required.
Metadata object code
...
// Build a default metadata object
DataSourceMetadata DSMetaData = DLDataPusher.createDataSourceMetadata(
"informix", // Datasource
"ids2omf", // Client application identifier
"informix", // Datasource name
0, // score (unused)
new Date(), // Date
"", // Language
", // Security access tokens
"text/plain", // Content type of content (MIME type)
"", // Character set
content ); // Actual content
// call the push method
try {
DLResponse dlRes = DLDataPusher.pushData(hostname, port, clientID,
passwd, uri, collectionID, DSMetaData, content);
// check the result
if (dlRes != null)
if (dlRes.getCodeName() == "SUCCESS")
recordsPushed = recordsPushed + 1;
else {
System.out.println("Problem pushing to OmniFind: URI:"+ uri + "Problem:"
+ dlRes.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
...
|
The Datasource, client application identifier, and datasource name fields can be populated with whatever information makes sense in your environment. These values are not required for OmniFind to work properly.
Step 5: Putting it all together
The attached ids2omf.java implements the functionality described in this article. To make the code more flexible, an addition has been made to read and parse an XML parameter file that defines the Informix connection parameters; SELECT statement and key columns; and the OmniFind parameters.
To compile this code, the following JAR files must be accessible on your CLASSPATH environment variable: siapi.jar, es.dl.client.jar, es.dl.server.jar, esapi.jar, ifxjdbc.jar, es.oss.jar, ifxtools.jar, ifxjdbcx.jar, ifxlang.jar, ifxlsupp.jar,ifxsqlj.jar
Running the example code
esadmin@linux:~/ids2omf> cat set_classpath.sh
export CLASSPATH=/opt/IBM/es/lib/siapi.jar:/opt/IBM/es/lib/es.dl.client.jar:
/opt/IBM/es/lib/es.dl.server.jar:/opt/IBM/es/lib/esapi.jar:/opt/IBM/es/lib/es.oss.jar:
/opt/IBM/informix/jdbc/lib/ifxjdbc.jar:/opt/IBM/informix/jdbc/lib/ifxtools.jar:
/opt/IBM/informix/jdbc/lib/ifxjdbcx.jar:/opt/IBM/informix/jdbc/lib/ifxlang.jar:
/opt/IBM/informix/jdbc/lib/ifxlsupp.jar:/opt/IBM/informix/jdbc/lib/ifxsqlj.jar:.
esadmin@linux:~/ids2omf> . ./set_classpath.sh
esadmin@linux:~/ids2omf> javac ids2omf.java
esadmin@linux:~/ids2omf> java -cp .:$CLASSPATH ids2omf sources.xml
-------------------------------------------------------------------------
Starting Informix to OmniFind utility
-------------------------------------------------------------------------
Input Parameter file:sources.xml
Informix URL: jdbc:informix-sqli://10.0.0.106:40000/stores_demo:
INFORMIXSERVER=demo_on_tcp;user=informix;password=informix
OmniFind Target:10.0.0.106:6668:test:test:col_33762
Source selections in parameter file: 1
Connected to source database.
-------------------------------------------------------------------------
Tables: cust_calls
Where_clause: ""
Key column:customer_num
Data Column: call_descr
Data Type Column: "text/plain"
Push complete: Records selected:7, Records pushed:7
Disconnect from source database.
-------------------------------------------------------------------------
esadmin@linux:~/ids2omf>
|
Deprecated DataListener methods
The DataListener interface has been deprecated at Version 8.3 of WebSphere Information Integrator OmniFind Edition in favor of a more streamlined interface. The Java examples used in this article will still function as documented in Version 8.3; however, in future releases the Search and Index API calls should be used.
See Chapter 4 of the Programming Guide and API Reference V8.3 for more information.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample Java class file for this article | ids2omf.java | 13KB | HTTP |
| Sample XML file for this article | sources.xml | 2KB | HTTP |
Information about download methods
Learn
-
" Introduction to WebSphere Information Integrator OmniFind Edition"
-
"Use Web services to access WebSphere Information Integrator OmniFind Edition
"
-
"Towards the next generation of enterprise search technology" by A. Z. Broder and A. C. Ciccolo, IBM Systems Journal. Volume 43, Number 3, 2004
-
"Manipulating Data with Regular Expressions"
-
"IBM Search and Index APIs (SIAPI) for WebSphere Information Integrator OmniFind Edition"
-
" Downloadable Bladelets and Demos for Informix"
Get products and technologies
-
Build your next development project with
IBM
trial software, available for download directly from developerWorks.
Prior to working at IBM, Paul was a pre-sales engineer at Informix, specializing in the implementation of DataBlade technology and the use of extended data types (especially spatial). In the last 4 years at IBM, Paul has broadened his focus to include other information management topics such as Enterprise Search and Content Management technologies and their integration into the enterprise.




