Crawling multiple structured JDBC database tables

You can configure the JDBC database crawler to join multiple structured tables that have the same key fields.

About this task

When you configure crawler properties for a JDBC database crawler, you can specify a plug-in for crawling multiple structured tables that are related to each other through key fields. Without this plug-in, rows in a database table are treated like individual documents and the values of the database columns are searchable as individual fields. With this plug-in, rows from multiple tables in a relational database that have the same key fields are joined and treated as a single document. The crawler adds data that it retrieves from the joined tables to the metadata for the original row of a database table. When a user searches the database, this additional data appears as additional fields when the document is displayed in the search results.

Data types that cannot be crawled
The crawler cannot crawl fields in the tables that you join that contain these binary data types:
BLOB 
CHARACTER FOR BIT DATA 
VARCHAR FOR BIT DATA 
LONG VARCHAR FOR BIT DATA
Limitations on the scope of the crawl space
The tables to be joined must be in the same relational database. You cannot join tables across databases.
If a table in a database is configured to be joined with other tables, this setting is universal for all crawlers in a collection that are enabled to use the plug-in. However, you can create multiple collections and configure separate crawlers to crawl different root tables and join different tables.
Restrictions on the use of other plug-ins
If you configure the crawler to use the plug-in for crawling multiple structured tables, you cannot associate another plug-in with the crawler. For example, you cannot specify a custom plug-in for applying business and security rules. You cannot associate more than one plug-in with a crawler.
Restrictions on number of tables, rows, fields, and keys
The maximum number of joined tables per database is five, and the sum of the rows in those tables must be less than 1,000,000. The maximum number of fields that can be read from a table is 10. To join tables, a key pair is used. That means it is not possible to join tables by using multiple keys.
Ensuring that changes in joined tables are crawled
If the rows in a root table do not change between crawls, and the crawler is not configured to do a full crawl, the crawler ignores the unchanged rows. If rows in a table that is joined to the root table change, even though the root table does not, you need to do one of the following actions to ensure that the changes are detected and crawled:
  • A root table in the target database must have a timestamp field. Configure the target database to have a timestamp field that gets updated when a row in the root table changes or when rows in any of the joined subsidiary tables change. When you set up the JDBC database crawler, be sure to specify this timestamp field as the field that the crawler uses to determine whether changes in the tables occurred.
  • Specify that the crawler is to do a full crawl when you configure the crawler schedule. This option ensures that all of the tables are crawled each time regardless of whether any changes occurred.

Procedure

To set up the JDBC database crawler to crawl multiple structured tables:

  1. In the administration console, create a JDBC database crawler. Select only the parent table to be crawled in the crawler configuration.
  2. Log in as the default Watson Content Analytics administrator on the crawler server.
  3. Copy the ES_INSTALL_ROOT/default_config/crawler_rdb_plugin.xml file to create the ES_NODE_ROOT/master_config/crawler_rdb_plugin.xml file.
  4. Edit the ES_NODE_ROOT/master_config/crawler_rdb_plugin.xml file with a text editor that supports XML files with UTF-8 character encoding, such as Notepad. When you edit the configuration file:
    • Enter changes appropriate for the database tables that the JDBC database is to crawl.
    • Enter all column names of the database tables in uppercase letters.
    • Save all changes in UTF-8 encoding format.
    1. Edit the <Server DBURL="jdbc:db2://db_server_url:50000/SAMPLE"> element and replace jdbc:db2://db_server_url:50000/SAMPLE with the URL of the JDBC database to be crawled.
    2. If the database to be crawled is not a DB2® database, edit the <JDBCDriver>com.ibm.db2.jcc.DB2Driver</JDBCDriver> element and replace com.ibm.db2.jcc.DB2Driver with the appropriate JDBC driver. For example, for an Oracle database, enter oracle.jdbc.driver.OracleDriver. Be sure to specify the same driver that you specified when you created the crawler.
    3. Edit the <User>username</User> element and replace username with a user ID that has authority to access the database to be crawled.
    4. Edit the <Password Encryption="True">encrypted_password</Password> element and replace encrypted_password with an encrypted password for the specified user ID. You can copy the encrypted password from the ES_NODE_ROOT/master_config/col_collection_name.JDBC_crawler_name/jdbccrawler.xml file and paste it here. If the password does not need to be encrypted, replace Encryption="True" with Encryption="False", and replace encrypted_password with a plain text password.
    5. If you leave the <Delimiters Use="True"> element as is, multiple terms in a column are separated by comma characters (,) defined in the <Delimiter> element. Sets of terms per table are separated by semicolon characters (;) defined in the <SecondDelimiter> element. If you set <Delimiters Use="True"> to <Delimiters Use="False">, delimiter characters are not used and multiple metadata fields with the same field name are added as document metadata.
    6. If you use the <Delimiters Use="True"> element, edit the <Delimiter>,</Delimiter> element and the <SecondDelimiter>;</SecondDelimiter> element to specify the characters that you want to use as value separators.
    7. Edit the <RelationMap Root="DB2INST1.TABLE_0"> element and replace DB2INST1.TABLE_0 with the name of a root table that is to be crawled.
    8. Edit the <Relation Parent="DB2INST1.TABLE_0" ParentAlias="T0" ParentKey="ID" Child="DB2INST1.TABLE_1" ChildAlias="T1" ChildKey="ID"/> element.
      • Replace Parent="DB2INST1.TABLE_0" with the name of a table that is a parent in the relation.
      • Replace ParentKey= "T0" with an alias of the parent table. This alias must be unique and not duplicated in the crawler_rdb_plugin.xml file.
      • Replace ParentKey="ID" with the name of a column that is used as a key field in the relation.
      • Replace Child="DB2INST1.TABLE_1" ChildAlias="T1" ChildKey="ID" with information about a child table to be crawled.
      This structure defines how the tables are to be joined. For example, the following relationship map specifies that a root table named DB2INST1.TABLE_A is to be crawled. Tables DB2INST1.TABLE_B and DB2INST1.TABLE_C are joined under the condition DB2INST1.TABLE_A.ID=DB2INST1.TABLE_B.ID AND DB2INST1.TABLE_B.ID=DB2INST1.TABLE_C.ID.
      <RelationMap Root="DB2INST1.TABLE_A">
      <Relation Parent="DB2INST1.TABLE_A" ParentAlias="TA" ParentKey="ID" 
        Child="DB2INST1.TABLE_B" ChildAlias="TB" ChildKey="ID"/> 
      <Relation Parent="DB2INST1.TABLE_B" ParentAlias="TB" ParentKey="ID" 
        Child="DB2INST1.TABLE_C" ChildAlias="TC" ChildKey="ID"/>
      Restriction: All references to the table or to field names in the database are case-sensitive. Specify them in the same case as defined in the database.
    9. Repeat step 4.h to create <Relation> elements for all relations that join tables from a root table.
    10. Edit the <Target TableAlias="T1"> element and replace TableAlias="T1" with a ChildAlias value that you defined in step 4.h.
    11. Edit the <Field Name="ID" FieldName="ID_1" Enabling="True" Searchable="True" FieldSearchable="True" IsContent="True"/> element.
      • Replace Name="ID" with the name of a column in the documents to be crawled.
      • Replace FieldName="ID_1" with the name of a metadata field in the documents to be crawled. This value is used as the display name for the column in the administration console and the search results.
      • Replace Enabling="True" with "False" if this column is not to beincluded in the document metadata.
      • Replace Searchable="True" with "False" to prevent users from searching this column with a free text query.
      • Replace FieldSearchable="True" with "False" to prevent users from searching this column by the column name.
      • Replace IsContent="True" with "False" to indicate that the column does not contain searchable content. If you specify Searchable="True" and IsContent="True", then the value of the column is used to detect duplicate documents. The value also becomes part of the dynamic document summary in the search results.
    12. Repeat step 4.k to create <Field> elements for all of the columns that are to be crawled.
    13. Repeat steps 4.j and 4.k to create <Target> and <Field> elements for all of the child tables that are referenced in the relationship map (<RelationMap>).
    14. Repeat steps 4.g through 4.m to create multiple relationship maps for multiple root tables.
    15. Repeat steps 4.a through 4.n to configure a relationship map for another database.
  5. Configure the crawler to use the plug-in:
    1. Open the administration console, edit a collection, and select the Crawl page.
    2. In the administration console, select the crawler that you created and click Configure > Crawler properties.
    3. In the Plug-in class name field, type the name of the plug-in for crawling multiple structured tables:

      com.ibm.es.plugin.rdb.RDBPlugin

    4. In the Plug-in class path field, type the fully qualified paths for the plug-in and the JDBC drivers that are used by the plug-in. For example, the path for the JDBC driver for a DB2 database on a Windows system might be:

      C:\Program Files\IBM\es\lib\plugin_rdb.jar;C:\Program Files\IBM\SQLLIB\java\db2jcc.jar;C:\Program Files\IBM\SQLLIB\java\db2jcc_license_cu.jar

      On an AIX® or Linux system, the path might be:

      /opt/IBM/es/lib/plugin_rdb.jar:IBM/SQLLIB/java/db2jcc.jar:/opt/IBM/SQLLIB/java/db2jcc_license_cu.jar

  6. Create index fields for the child table:
    1. In the Parse and Index pane, click Configure > Index fields.
    2. Click Create Index Field and create an index field that you defined in the crawler_rdb_plugin.xml file as part of step 4.k. Repeat this step to create other index fields.
  7. To deploy the crawler_rdb_plugin.xml file to the system configuration, restart the Watson Content Analytics system:

    esadmin system stop
    esadmin system start