Building an Enhanced Information Retrieval Solution

Retrieve and Rank's ability to find the best response to a natural language query from a large set of documents is a natural fit to be used in combination with the Document Conversion service, which focuses on processing formatted documents such as PDFs, Microsoft Word documents, and HTML pages. With the addition of several new developer tools it is now possible to enable an end-to-end machine learning enhanced information retrieval solution which can ingest documents from various repositories directly into Retrieve and Rank without writing any code.

Enhanced information retrieval allows you to:

  • Scale your information retrieval solution to hundreds of thousands of documents
  • Focus on getting relevant results rather than document collection and formatting
  • Manage multiple Watson Developer Cloud services from one set of tools without using the Bluemix user interface

Enhanced Information Retrieval workflow

We've made it easy to create this end-to-end solution by using these tools:

  • Kale is a command line tool that helps you quickly create, configure, and manage the Watson services you need to gather documents and query them: Retrieve and Rank and Document Conversion. You will need your Bluemix id and password to use Kale.

  • Data Crawler is a command line tool that will help you take your documents from the repositories where they reside (for example: file shares, databases, Microsoft Sharepoint ®) and push them to the cloud to automatically create a Retrieve and Rank index.

  • Additionally, there is a web-based tool designed to help users get started with Retrieve and Rank and Document Conversion quickly, train a ranker, and easily evaluate improvements between one ranker to another.

    Note: If you have created a custom schema for your service, this user interface will not work in its current iteration.

Prerequisites and Requirements

Prerequisite:

  • You need a Bluemix account. If you don't already have one, you can create it here.

System requirements:

  • Kale:

    • Java Runtime Environment version 6 or higher. Version 8 or higher is recommended for optimum security.
    • Kale can be run on any operating system that can run the above JRE. For a list of operating systems Kale has been validated on, see https://github.com/IBM-Watson/kale.
  • Data Crawler:

    • Java Runtime Environment version 8 or higher

      Note: Your JAVA_HOME environment variable must be set correctly, or not be set at all, in order to run the Crawler.

    • Red Hat Enterprise Linux 6 or 7, or Ubuntu Linux 15 or 16. For optimal performance, it is strongly recommended that the Data Crawler should run on its own instance of Linux, whether it is a virtual machine, a container, or hardware.

    • Minimum 2 GB RAM on the Linux system

What to do next

Download and install Kale

Downloading and installing Kale

Kale is a command line tool that helps you quickly create and configure the Document Conversion and Retrieve and Rank services.

To download and install it:

  1. Click here to download and save the kale-x.y.z-standalone.jar file to the location you prefer.

  2. Set up a command-line alias (shortcut) for Kale.

    Use the following as an example. <install-directory> is a variable that represents the pathname of the installation directory.

    kale='java -jar <install-directory>/kale-x.y.z-standalone.jar'
    

Logging into Kale

Note: The command kale help {name of command} will display the help for that command.

Note: Enter all commands in lower case. You do not need to enter the curly brackets ({}).

Note: If you are in the Kale working directory and switch to another directory, you will need to login to Kale again.

  1. Open a terminal or command line window.

  2. Enter kale login.

  3. The endpoint for the APIs Kale will use will display. The default endpoint for Watson services in the public version of Bluemix is https://api.ng.bluemix.net. If the endpoint displayed is correct, hit Return. If not, enter the desired endpoint and hit Return.

  4. When prompted, enter your Username and Password. Use the same login information you use for Bluemix.

    Note: IBM employees may have a Bluemix password that differs from their IBM password. Alternatively, if an IBM employee has single sign-on enabled, Kale will prompt you open a URL to obtain a passcode you can use to login.

After you have logged in, the Current Environment names will display:

  • User
  • Endpoint
  • Organization
  • Space

What to do next

Creating and Configuring Services with Kale

Creating and configuring services with Kale

Kale helps you quickly create, configure, and manage the Watson services you need to gather documents and query them. There are several commands available in Kale, but if you'd like to run all of the commands in one step, you can use the kale assemble command.

This command will create in a single operation:

  • A Document Conversion service
  • A Retrieve and Rank service
  • A Solr cluster
  • A Solr collection
  • Set the language of the collection

Note: Enter kale help for a list of all the Kale commands. Enter kale help {name of command} for the help for that command.

To create the services, cluster, and collection without setting the cluster size, Enter

kale assemble {base-name} {language}

Note: If {cluster-size} is not specified, kale assemble will use the "free" 300 MB cluster when creating the Solr cluster. You can only create one "free" cluster. A new space will be created to store the components.

Note: For information on cluster pricing, see Sizing your Retrieve and Rank Cluster and thePick a plan page on Bluemix.

  • {base-name} will be the name used for your services. It does not need to contain a hyphen, but the name should not include spaces.
  • {language} will be the language of your document collection. Available languages are: english, german, spanish, arabic, brazilian, french, italian, japanese. (The languages are listed in lower case because they must be entered that way.)

To create the services, collection, and a cluster of a specific size, Enter

kale assemble {base-name} {language} {cluster-size}

Services can be provisioned using the Premium plan by setting the premium flag.

kale assemble {base-name} {language} --premium

Note: Premium provisioning is not currently available for the Retrieve and Rank service. Even if the premium flag is set, Retrieve and Rank will be provisioned using the Standard plan.

After you have run kale assemble enter kale list to display the details:

  • document_conversion_service: name
  • retrieve_and_rank_service: name
  • cluster: name
  • solr_configuration: language
  • collection: name

Note:

If kale assemble fails at any stage, it will rollback to the original state by deleting the services and space it created. One reason kale assemble could fail is that you have specified a base-name that is a duplicate of an existing service, space, or cluster.

If you prefer to use individual Kale commands (instead of kale assemble), see Additional Kale Commands.

What to do next

Test the conversion of a few documents with Kale

Additional Kale commands

If you prefer to use individual Kale commands (instead of kale assemble) to create your Document Conversion service, Retrieve and Rank service, Solr cluster, Solr collection, and set the language of the collection, you can follow these steps.

Note: Enter kale help for a list of all the Kale commands. Enter kale help {name of command} for the help for that command.

Note: If you do not know the names of your organizations or spaces, enter kale list organizations or kale list spaces.

Select an organization (if you only have one organization in Bluemix, skip this step):

  1. Enter kale list organizations to see the organizations that are available to you.
  2. Enter kale select organization {organization} to select an organization. You cannot create a new organization with Kale.

Select or create a space (if you only have one space in Bluemix, skip this step):

  1. Enter kale list spaces to see the spaces you have available.
  2. Enter kale select space {space} to select a space. Alternatively, to create a new space, enter kale create space {space_name}; if you create a new space, it is automatically selected.

Use Kale to create instances of two Watson services: Document Conversion and Retrieve and Rank.

  1. The Document Conversion service takes your documents and converts them into a format the Retrieve and Rank service can use. Create an instance of the service as follows:

    kale create document_conversion {service_name}
    
  2. The Retrieve and Rank service provides search capabilities. Create an instance of the service as follows:

    kale create retrieve_and_rank {service_name}
    

After your Retrieve and Rank service instance is created, you need to do a small bit of configuration.

  1. Create a cluster. A Solr cluster manages your search collections. Enter kale create cluster {cluster_name} --wait. The following will spin "…" until the cluster is created.

    Creating a cluster can take some time. To check the status of the cluster creation, enter kale list services. If the cluster is created, the message will be status: READY. If the cluster is not yet ready, the message will be status: NOT_AVAILABLE.

  2. Create a Solr configuration. A Solr configuration identifies how to index your documents so you can search the important fields. The configuration we create identifies the language of the document in your collection. A collection can contain documents only in a single language. Enter kale create solr-configuration {language} where {language} is one of the following: english, german, spanish, arabic, brazilian, french, italian, japanese. (The languages are listed in lower case because they must be entered that way.)

  3. A collection is the location of your data in the cloud. Only one collection is required. To create a collection, enter kale create collection {collection_name}.

You have now finished creating and configuring instances of the Document Conversion and Retrieve and Rank services.

Enter kale list to display the details:

  • document_conversion_service: name
  • retrieve_and_rank_service: name
  • cluster: name
  • solr_configuration: language
  • collection: name

What to do next

Test the conversion of a few documents with Kale

Testing the conversion of documents with Kale

This step is optional. It is a quick test to check how the Document Conversion service will convert your documents. It is not to be used to convert all of your documents, and the converted test documents will not be used by the Retrieve and Rank service. In production, your documents will be will be uploaded with the Data Crawler, and converted with Document Conversion service.

You can convert one or several documents by entering:

kale dry-run {file1} {file2} {file3} ...

Note: Supported file types are PDFs, Microsoft Word documents, and HTML pages.

The converted documents will be JSON files that can be found in the converted directory of your current working directory. The files in the converted directory will include the complete file path of your the test document(s). You can open and review the JSON files, which will include the conversion schema plus the document text.

What to do next

Create the Data Crawler configuration with Kale

Creating the Data Crawler configuration with Kale

The Data Crawler requires two files in order to convert your documents in the format required by the Watson services you created earlier. You create these configuration files with Kale. After the Data Crawler is installed you will copy these files into config directory of the Data Crawler working directory.

  1. Enter kale create crawler-configuration.

  2. The following files will be created in the local directory:

    • orchestration_service.conf
    • orchestration_service_config.json

What to do next

Downloading and installing the Data Crawler

Downloading and installing the Data Crawler

The Data Crawler collects the raw data that is eventually used to form search results for the Retrieve and Rank service. When crawling data repositories, the Crawler downloads documents and metadata, starting from a user-specified seed URL. The Crawler discovers documents in a hierarchy or otherwise linked from the seed URL and enqueues these for retrieval.

Prerequisites

See Building an Enhanced Information Retrieval Solution.

Downloading and installing the Data Crawler

  1. Open a browser and log into your Bluemix account.

  2. From your Bluemix Dashboard, select the Retrieve and Rank service you previously created with Kale.

  3. Click the Download Data Crawler link to download the Data Crawler.

  4. As an administrator, use the appropriate commands to install the archive file that you downloaded:

    • On systems such as Red Hat and CentOS that use rpm packages, use a command such as the following:

    rpm -i /full/path/to/rpm/package/rpm-file-name
    
    • On systems such as Ubuntu and Debian that use deb packages, use a command such as the following:

    dpkg -i /full/path/to/deb/package/deb-file-name
    
    • The Crawler scripts are installed into {installation directory}/bin; for example, /opt/ibm/crawler/bin. Ensure that {installation_directory}/bin is in your PATH environment variable for the Crawler commands to work correctly.

    Note: Crawler scripts are also installed to /usr/local/bin, so this can be added to your PATH environment variable as well.

Known limitations in this release

  • The Data Crawler may hang when running the Filesystem connector with an invalid or missing URL.
  • Configure the urls_to_filter value in the crawler.conf file, such that all the whitelist URLs or RegExes are included in a single RegEx expression. See Configuring crawl options for more information.
  • The path to the configuration file passed in the --config | -c option must be a qualified path. That is, it must be in the relative formats config/crawler.conf or ./crawler.conf, or absolute path /path/to/config/crawler.conf. Specifying just crawler.conf is only possible if the orchestration_service.conf file is in-lined instead of referenced using include in the crawler.conf file.
  • The Data Crawler is capable of ingesting data at volumes sufficiently large to trigger a known issue in the Retrieve and Rank service. The issue may cause some documents not to be indexed. The Retrieve and Rank service team is actively investigating this issue as of this release.

Setting up the Data Crawler

To set up the Data Crawler to crawl your repository, you must specify the appropriate input adapter in the crawler.conf file, and then configure repository-specific information in the input adapter configuration files.

First, you must copy the contents of the {installation_directory}/share/examples/config directory to a working directory on your system, for example /home/config.

Warning: Do not modify the provided configuration example files directly. Copy and then edit them. If you edit the example files in-place, your configuration may be overwritten when upgrading the Data Crawler, or may be removed when uninstalling it.

Note: References in this guide to files in the config directory, such as config/crawler.conf, refer to that file in your working directory, and NOT in the installed {installation_directory}/share/examples/config directory.

The specified values below are the defaults in config/crawler.conf, and configure the Filesystem connector:

  1. First, verify that you are running Java Runtime Environment version 8 or higher. Run the command java -version, and look for 1.8. If you are running something earlier than 1.8, you need to upgrade Java by installing the Java Developer Kit (JDK) 8 from your package management system, from the IBM JDK website, or from java.com.

    Note: Your JAVA_HOME environment variable must be set correctly, or not be set at all, in order to run the Crawler.

  2. Copy the configuration files you created with Kale into the config directory; for example:

    cp orchestration_service.conf config/orchestration
    cp orchestration_service_config.json config/orchestration
    
  3. Open the config/crawler.conf file in a text editor.

    • Set the crawl_config_file option to connectors/filesystem.conf.
    • Set the crawl_seed_file option to seeds/filesystem-seed.conf.

    Save and close the config/crawler.conf file.

  4. Open the seeds/filesystem-seed.conf file in a text editor. Modify the value attribute directly under the name="url" attribute to the file path that you want to crawl. For example:

    value="sdk-fs:///TMP/MY_TEST_DATA/"
    

    Fast path: For the "fast path" sample data crawl, the file path that you want to crawl is the directory in which your PDFs are located. The JSON versions of your PDFs that were produced by your document conversion testing are not used.

    Save and close the file.

  5. The other options in this file are set to good defaults.

  6. If necessary, you can edit the connectors/filesystem.conf and seeds/filesystem-seed.conf files. However, the defaults provided in these files do not generally need to be modified. After modifying these files, you are ready to crawl your data. Proceed to Crawling your data repository to continue.

Configuring crawl options

Fast path: Configuring these options is not necessary for the default "fast path" sample data crawl.

The file config/crawler.conf contains information that tells the Data Crawler which files to use for its crawl (input adapter), where to send the collection of crawled files once the crawl has been completed (output adapter), and other crawl management options.

Note: All file paths are relative to the config directory, except where noted.

Important: To access the in-product manual for the crawler.conf file, with the most up-to-date information, type the following command from the Crawler installation directory:

man crawler.conf

The options that can be set in this file are:

Input Adapter

  • class - Internal use only; defines the Data Crawler input adapter class. Currently, the only input adapter that exists is the default value of: com.ibm.watson.crawler.connectorframeworkinputadapter.Crawl

  • config - Internal use only; defines the connector framework configuration. The default configuration key within this block to pass to the chosen input adapter is: connector_framework

    Note: The connector framework is what allows you to talk to your data. It could be internal data within the enterprise, or it could be external data on the web or in the cloud. The connectors allow access to a number of different data sources, while connecting is actually controlled by the crawling process.

    Important: Data retrieved by the Connector Framework Input Adapter is cached locally. It is not stored encrypted. By default, the data is cached to a temporary directory that should be cleared on reboot, and should be readable only by the user who executed the crawler command.

    There is a chance that this directory could outlive the crawler if the connector framework was to go away before it could clean up after itself. Carefully consider the location for your cached data - you may put it on an encrypted filesystem, but be aware of the performance implications of doing so. Only you can decide the appropriate balance between speed and security for your crawls.

  • crawl_config_file - The configuration file to use for the crawl. Default value is: connectors/filesystem.conf

  • crawl_seed_file - The crawl seed file to use for the crawl. Default value is: seeds/filesystem-seed.conf

  • id_vcrypt_file - Keyfile used for data encryption by the Crawler; the default key included with the crawler is id_vcrypt. Use the vcrypt script in the bin folder if you need to generate a new id_vcrypt file.

  • crawler_temp_dir - The Crawler temporary folder for connector logs. Default value, tmp, is provided. If it doesn't already exist, the tmp folder will be created in the current working directory.

  • extra_jars_dir - Adds a directory of extra JARs to the connector framework classpath.

    Note: Relative to the connector framework lib/java directory. - This value must be oakland when using the SharePoint connector. - This value must be database when using the Database connector.

    You can leave this value empty (i.e., empty string "") when using other connectors.

  • urls_to_filter - Whitelist of URLs to crawl, in regular expression form. The Data Crawler only crawls URLs which match one of the regular expressions provided.

    The domain list contains the most common top-level domains; add to it if necessary.

    The file extension-type list contains the file extensions that the Orchestration Service supports, as of this release of the Data Crawler.

    Ensure that your seed URL domain is allowed by the filter. For example, if the seed URL looks like http://testdomain.test.in, add "in" to the domain filter.

    Ensure that your seed URL will not be excluded by a filter, or the Crawler may hang.

  • max_text_size - The maximum size, in bytes, that a document can be before it is written to disk by the Connector Framework. Adjusting this higher decreases the amount of documents written to disk, but increases the memory requirement. Default value is 1048576

  • extra_vm_params - Allows you to add extra Java parameters to the command used to launch the Connector Framework.

  • bootstrap_logging - Writes connector framework startup log; useful for advanced debugging only. Possible values are true or false. Log file will be written to crawler_temp_dir

Output Adapter

There are a couple of output adapters from which to choose. Set the output adapter by setting the class.

  • class - Internal use only; defines the Data Crawler output adapter class. The default value is: com.ibm.watson.crawler.orchestrationserviceoutputadapter.oneatatime.OrchestrationServiceOutputAdapter.

    Note: You can also set the value of this class to com.ibm.watson.crawler.testoutputadapter.TestOutputAdapter in order to run the Test Output Adapter.

  • config - Defines which configuration key to pass to the output adapter. The string must correspond to a key within this configuration object. In the following code example:

     orchestration_service {
                 include "orchestration_service.conf"
              },
              test {
                 output_directory = "/tmp/crawler-test-output"
              }
             }
    

    the configuration key is orchestration_service.

    Note: You may also set this value to test instead of orchestration_service. If config were to be set to test in this example, instead of orchestration_service, the test output directory would be /tmp/crawler-test-output.

  • TestOutputAdapter - The Test Output Adapter writes a representation of the crawled files to disk in a specified location. To configure Data Crawler to use TestOutputAdapter, you need to set the value of class to com.ibm.watson.crawler.testoutputadapter.TestOutputAdapter, and the value of config must be test.

  • retry - Specifies the options for retry in case of failed attempts to push to the output adapter.

    • max_attempts - Maximum number of retry attempts. Default value is 4
    • delay - Minimum amount of delay between attempts, in seconds. Default value is 1
    • exponent_base - Factor that determines the growth of the delay time over each failed attempt. Default value is 2

    The formula is:

    d(nth_retry) = delay * (exponent_base ^ nth_retry)
    

    For example, the default settings with a delay of 1 second and an exponent base of 2, will cause the second retry - the third attempt - to delay 2 seconds instead of 1, and the next to delay 4 seconds.

    d(0) = 1 * (2 ^ 0) = 1 second
    d(1) = 1 * (2 ^ 1) = 2 seconds
    d(2) = 1 * (2 ^ 2) = 4 seconds
    d(3) = 1 * (2 ^ 3) = 8 seconds
    

    So, with the default settings, a submission will be attempted up to five times, waiting up to approximately 15 seconds. This time is approximate because there is additional time added in order to avoid having multiple resubmissions execute simultaneously. This "fuzzed" time is up to 10%, so the last retry in the previous example could delay up to 8.8 seconds. The wait time does not include the time spent connecting to the service, uploading data, or waiting for a response.

Additional crawl management options

  • full_node_debugging - Activates debugging mode; possible values are true or false.

    Warning: This will put the full data of every document crawled into the logs.

  • logging.log4j.configuration_file - The configuration file to use for logging. In the sample crawler.conf file, this option is defined in logging.log4j and its default value is log4j_custom.properties. This option must be similarly defined whether using a .properties or .conf file.

  • shutdown_timeout - Specifies the timeout value, in minutes, before shutting down the application. Default value is 10.

  • output_limit - The highest number of indexable items that the Crawler will try to send simultaneously to the output adapter. This can be further limited by the number of cores available to do the work. It says that at any given point there will be no more than "x" indexable items sent to the output adapter waiting to return. Default value is 10.

  • input_limit - Limits the number of URLs that can be requested from the input adapter at one time. Default value is 3.

  • output_timeout - The amount of time, in seconds, before the Data Crawler gives up on a request to the output adapter, and then removes the item from the output adapter queue to allow more processing. Default value is 610.

    Note: Consideration should be given to the constraints imposed by the output adapter, as those constraints may relate to the limits defined here. The output_limit defined above only relates to how many indexable objects can be sent to the output adapter at once. Once an indexable object is sent to the output adapter, it is "on the clock," as defined by the output_timeout variable. It is possible that the output adapter itself has a throttle preventing it from being able to process as many inputs as it receives. For instance, the orchestration output adapter may have a connection pool, configurable for HTTP connections to the service. If it defaults to 8, for example, and if you set the output_limit to a number greater than 8, then you will have processes, on the clock, waiting for a turn to execute. You may then experience timeouts.

  • num_threads - The number of parallel threads that can be run at one time. This value can be either an integer, which specifies the number of parallel threads directly, or it can be a string, with the format "xNUM", specifying the multiplication factor of the number of available processors, for example, "x1.5". The default value is "30"

Configuring connector and seed options

When crawling data, the Crawler first identifies the type of data repository (connector) and the user-specified starting location (seed) to begin downloading information.

Important: When using the Data Crawler, data repository security settings are ignored.

Seeds are the starting points of a crawl, and are used by the Data Crawler to retrieve data from the resource that is identified by the connector. Typically, seeds configure URLs to access protocol-based resources such as fileshares, SMB shares, databases, and other data repositories that are accessible by various protocols. Moreover, different seed URLs have different capabilities. Seeds can also be repository-specific, to enable crawling of specific third-party applications such as customer relationship management (CRM) systems, product life cycle (PLC) systems, content management systems (CMS), cloud-based applications, and web database applications.

To crawl your data correctly, you must ensure that the Crawler is properly configured to read your data repository. The Data Crawler provides connectors to support data collection from the following repositories:

A connector configuration template is also provided, which allows you to customize a connector.

Important: To access the in-product manual for the connector and seed configuration files, with the most up-to-date information, type the following commands from the Crawler installation directory:

  • For connector configuration options:

    man crawler-options.conf
    
  • For crawl seed configuration options:

    man crawler-seed.conf
    

Configuring Filesystem crawl options

The Filesystem Connector allows you to crawl files local to the Data Crawler installation.

Configuring the Filesystem Connector

Following are the basic configuration options that are required to use the Filesystem connector. To set these values, open the file config/connectors/filesystem.conf, and modify the following values specific to your use cases:

  • protocol - The name of the connector protocol used for the crawl. Use sdk-fs for this connector.
  • collection - This attribute is used to unpack temporary files. The default value is crawler-fs
  • logging-config - Specifies the file used for configuring logging options; it must be formatted as a log4j XML string.
  • classname - Java class name for the connector. The value to use this connector must be plugin:filesystem.plugin@filesystem.

Configuring the Filesystem Crawl Seed

The following values can be configured for the Filesystem crawl seed file. To set these values, open the file config/seeds/filesystem-seed.conf and specify the following values specific to your use cases:

  • url - Newline-separated list of files and folders to crawl. UNIX users can use a path such as /usr/local/.

    Note: The URLs must start with sdk-fs://. So to crawl, for example, /home/watson/mydocs, the value of this URL would be sdk-fs:///home/watson/mydocs - the third / is necessary!

    Tip: The filesystems used by Linux, UNIX, and UNIX-like computer systems can contain special types of files, such as block and character device nodes and files that represent named pipes, which cannot be crawled because they do not contain data, but serve as device or I/O access points. Attempting to crawl such files will generate errors during the crawl. To avoid such errors, you should exclude the /dev directory in any top-level crawl on a Linux, UNIX, or UNIX-like filesystem. If present on the system that you are crawling, you should also exclude temporary system directories such as /proc, /sys, and /tmp, that contain transient files and system information.

  • hops - Internal use only.

  • default-allow - Internal use only.

Configuring Database crawl options

The database connector allows you to crawl a database by executing a custom SQL command and creating one document per row (record) and one content element per column (field). You can specify a column to be used as a unique key, as well as a column containing a timestamp representing the last-modification date of each record. The connector retrieves all records from the specified database, and can also be restricted to specific tables, joins, and so on in the SQL statement.

The Database connector allows you to crawl the following databases:

  • IBM DB2
  • MySQL
  • Oracle PostgreSQL
  • Microsoft SQL Server
  • Sybase
  • Other SQL-compliant databases, via a JDBC 3.0-compliant driver

The connector retrieves all records from the specified database and table.

JDBC Drivers - The Database connector ships with Oracle JDBC (Java Database Connectivity) driver version 1.5. All third-party JDBC drivers shipped with the Data Crawler are located in the connectorFramework/crawler-connector-framework-#.#.#/lib/java/database directory of your Data Crawler installation, where you can add, remove, and modify them as necessary. You can also use the extra_jars_dir setting in the crawler.conf file to specify another location.

DB2 JDBC Drivers - The Data Crawler does not ship with the JDBC drivers for DB2 due to licensing issues. However, all DB2 installations in which you have installed JDBC support include the JAR files that the Data Crawler requires, in order to be able to crawl a DB2 installation. To crawl a DB2 instance, you must copy these files into the appropriate directory in your Data Crawler installation so that the Database connector can use them. To enable the Data Crawler to crawl a DB2 installation, locate the db2jcc.jar and license (typically, db2jcc_license_cu.jar) JAR files in your DB2 installation, and copy those files to the connectorFramework/crawler-connector-framework-#.#.#/lib/java/database subdirectory of your Data Crawler installation directory, or you can use the extra_jars_dir setting in the crawler.conf file to specify another location.

MySQL JDBC Drivers - The Data Crawler does not ship with the JDBC drivers for MySQL because of possible license issues if they were delivered as part of the product. However, downloading the JAR file that contains the MySQL JDBC drivers and integrating that JAR file into your Data Crawler installation is quite easy to do:

  1. Use a web browser to visit the MySQL download site, and locate the source and binary download link for the archive format that you want to use (typically zip for Microsoft Windows systems or a gzipped tarball for Linux systems). Click that link to initiate the download process. Registration may be required.
  2. Use the appropriate unzip archive-file-name or tar zxf archive-file-name command to extract the contents of that archive, based on the type and name of the archive file that you download.
  3. Change to the directory that was extracted from the archive file, and copy the JAR file from this directory to the connectorFramework/crawler-connector-framework-#.#.#/lib/java/database subdirectory of your Data Crawler installation directory, or you can use the extra_jars_dir setting in the crawler.conf file to specify another location.

Configuring the Database Connector

Following are the basic configuration options that are required to use the Database connector. To set these values, open the file config/connectors/database.conf and modify the following values specific to your use cases:

  • protocol - The name of the connector protocol used for the crawl. The value for this connector is based on the database system to be accessed.
  • collection - This attribute is used to unpack temporary files.
  • classname - Java class name for the connector. The value to use this connector must be plugin:database.plugin@database.
  • logging-config - Specifies the file used for configuring logging options; it must be formatted as a log4j XML string.

Configuring the Database Crawl Seed

The following values can be configured for the Database crawl seed file. To set these values, open the file config/seeds/database-seed.conf and specify the following values specific to your use cases:

  • url - The URL of the table or view to retrieve. Defines your custom SQL database seed URL. The structure is:

    • database-system://host:port/database?[per=num]&[sql=SQL]

    Testing a seed URL will show all of the enqueued URLs. For example, testing the following URL for a database containing 200 records:

    • sqlserver://test.mycommpany.com:1433/WWII_Navy/?per=100&sql=select_*_from_vessel&

    shows the following enqueued URLs:

    • sqlserver://test.mycommpany.com:1433/WWII_Navy/?key-val=0&
    • sqlserver://test.mycommpany.com:1433/WWII_Navy/?key-val=100&
    • sqlserver://test.mycommpany.com:1433/WWII_Navy/?key-val=200&

    While testing the following URL will show the data retrieved from row 43:

    • sqlserver://test.mycompany.com:1433/WWII_Navy/?per=1&key-val=43
  • hops - Internal use only.

  • default-allow - Internal use only.

  • user-password - Credentials for the database system. The username and password need to be separated by a :, and the password must be encrypted using the vcrypt program shipped with the Data Crawler. For example username:[[vcrypt/3]]passwordstring.

  • max-data-size - Maximum size of the data for a document. This is the largest block of memory that will be loaded at one time. Only increase this limit if you have sufficient memory on your computer.

  • filter-exact-duplicates - Internal use only.

  • timeout - Internal use only.

  • jdbc-class (Extender option) - When specified, this string will override the JDBC class used by the connector when (other) is chosen as the database system.

  • connection-string (Extender option) - When specified, this string will override the automatically generated JDBC connection string. This allows you to provide more detailed configuration about the database connection, such as load-balancing or SSL connections. For example: jdbc:netezza://127.0.0.1:5480/databasename

  • save-frequency-for-resume (Extender option) - Specifies the name of a column or associated label, in order to be able to resume a crawl or do a partial refresh. The seed saves the name of this column at regular intervals as it proceeds with the crawl, and saves it again once the last row of your database has been crawled. When resuming the crawl or refreshing it, the crawl begins with the row that is identified in the saved value for this field

Configuring CMIS crawl options

The CMIS (Content Management Interoperability Services) connector lets you crawl CMIS-enabled CMS (Content Management System) repositories, such as Alfresco, Documentum or IBM Content Manager, and index the data that they contain.

Configuring the CMIS Connector

Following are the basic configuration options that are required to use the CMIS connector. To set these values, open the file config/connectors/cmis.conf and specify the following values specific to your use cases:

  • protocol - The name of the connector protocol used for the crawl. The value to use this connector must be cmis.

  • collection - This attribute is used to unpack temporary files.

  • dns - Unused option.

  • classname - Java class name for the connector. Use plugin:cmis-v1.1.plugin@connector for this connector.

  • logging-config - Specifies the file used for configuring logging options; it must be formatted as a log4j XML string.

  • endpoint - The service endpoint URL of a CMIS-compliant repository. For example, the URL structures for SharePoint are:

    • For AtomPub binding: http://yourserver/_vti_bin/cmis/rest?getRepositories
    • For WebServices binding: http://yourserver/_vti_bin/cmissoapwsdl.aspx
  • username - The user name of the CMIS repository user used to access the content. This user must have access to all the target folders and documents to be crawled and indexed.

  • password - Password of the CMIS repository used to access the content. Password must NOT be encrypted; it should be given in plain text.

  • repositoryid - The ID of the CMIS repository used to access the content for that specific repository.

  • bindingtype - Identifies what type of binding is to be used to connect to a CMIS repository. Value is either AtomPub or WebServices.

  • authentication - Identifies what type of authentication mechanism to use while contacting a CMIS-compatible repository: Basic HTTP Authentication, NTLM, or WS-Security(Username token).

  • enable-acl - Enables retrieving ACLs for crawled data. If you are not concerned about security for the documents in this collection, disabling this option will increase performance by not requesting this information with the document and not retrieving and encoding this information. Value is either true or false.

  • user-agent - A header sent to the server when crawling documents.

  • method - The method (GET or POST) by which parameters will be passed.

  • url-logging - The extent to which crawled URLs are logged. Possible values are:

    • full-logging - Log all information about the URL.
    • refined-logging - Only log the information necessary to browse the crawler log and for the connector to function correctly; this is the default value.
    • minimal-logging - Only log the minimum amount of information necessary for the connector to function correctly.

    Setting this option to minimal-logging will reduce the size of the logs and gain a slight performance increase due to the smaller I/O associated with minimizing the amount of data that is being logged.

  • ssl-version - Specifies a version of SSL to use for HTTPS connections. By default the strongest protocol available is used.

Configuring the CMIS Crawl Seed

The following values can be configured for the CMIS crawl seed file. To set these values, open the file config/seeds/cmis-seed.conf and modify the following values specific to your use cases:

  • url - The URL of a folder from the CMIS repository to be used as a starting point of the crawl, for example: cmis://alfresco.test.com:8080/alfresco/cmisatom?folderToProcess=workspace://SpacesStore/guid

    To crawl from the root folder, you need to give the URL as: cmis://alfresco.test.com:8080/alfresco/cmisatom?folderToProcess=

  • at - Unused option.

  • default-allow - Internal use only.

Configuring SMB/CIFS/Samba crawl options

The Samba connector allows you to crawl Server Message Block (SMB) and Common Internet Filesystem (CIFS) fileshares. This type of fileshare is common on Windows networks, and is also provided through the open source project Samba.

Configuring the Samba Connector

Following are the basic configuration options that are required to use the Samba connector. To set these values, open the file config/connectors/samba.conf and specify the following values specific to your use cases:

  • protocol - The name of the connector protocol used for the crawl. The value to use this connector is smb.

  • collection - This attribute is used to unpack temporary files.

  • classname - Java class name for the connector. The value to use this connector must be plugin:smb.plugin@connector.

  • logging-config - Specifies the file used for configuring logging options; it must be formatted as a log4j XML string.

  • username - The Samba username to authenticate with. If provided, domain and password must also be provided. If not provided, the guest account is used.

  • password - The Samba password to authenticate with. If the username is provided, this is required. Password must be encrypted using the vcrypt program shipped with the Data Crawler.

  • archive - Enables the Samba connector to crawl and index files that are compressed within archive files. Value is either true or false; default value is false.

  • max-policies-per-handle - Specifies the maximum number of Local Security Authority (LSA) policies that can be opened for a single RPC handle. These policies define the access permissions that are required to query or modify a particular system under various conditions. The default value for this option is 255.

  • crawl-fs-metadata - Turning on this option will cause the Data Crawler to add a VXML document containing the available filesystem metadata about the file (creation date, last modified date, file attributes, etc.).

  • enable-arc-connector - Unused option.

  • disable-indexes - Newline-separated list of indexes to disable, which may result in a faster crawl, for example:

    • disable-url-index
    • disable-error-state-index
    • disable-crawl-time-index
  • exact-duplicates-hash-size - Sets the size of the hash table used for resolving exact duplicates. Be very careful when modifying this number. The value that you select should be prime, and larger sizes can provide faster lookups but will require more memory, while smaller sizes can slow down crawls but will substantially reduce memory usage.

  • user-agent - Unused option.

  • timeout - Unused option

  • n-concurrent-requests - The number of requests that will be sent in parallel to a single IP address. The default is 1.

  • enqueue-persistence - Unused option.

Configuring the Samba Crawl Seed

The following values can be configured for the Samba crawl seed file. To set these values, open the file config/seeds/samba-seed.conf and specify the following values specific to your use cases:

  • url - A newline-separated list of shares to crawl, for example:

    smb://share.test.com/office
    smb://share.test.com/cash/money/change
    smb://share.test.com/C$/Program Files
    
  • hops - Internal use only.

  • default-allow - Internal use only.

Configuring SharePoint crawl options

Important: The SharePoint connector requires Microsoft SharePoint Server 2007 (MOSS 2007), SharePoint Server 2010, SharePoint Server 2013, or SharePoint Online.

The SharePoint connector allows you to crawl SharePoint objects and index the information that they contain. An object such as a document, user profile, site collection, blog, list item, membership list, directory page, and more, can be indexed with its associated metadata. For list items and documents, indexes can include attachments.

Note: The SharePoint connector respects the noindex attribute on all SharePoint objects, regardless of their specific type (blogs, documents, user profiles, and more). A single document is returned for each result.

Important: The SharePoint account that you use to crawl your SharePoint sites must at least have full read-access privileges.

Configuring the SharePoint Connector

Following are the basic configuration options that are required to use the SharePoint connector. To set these values, open the file config/connectors/sharepoint.conf and modify the following values specific to your use cases:

  • protocol - The name of the connector protocol used for the crawl. The value to use this connector is io-sp.

  • collection - This attribute is used to unpack temporary files.

  • classname - Java class name for the connector. Use plugin:io-sharepoint.plugin@connector for this connector.

  • logging-config - Specifies the file used for configuring logging options; it must be formatted as a log4j XML string.

  • seed-url-type - Identifies what type of SharePoint object the provided seed URLs point to: site collections or web applications (also known as virtual servers).

    • Site Collections - If the Seed URL Type is set to Site Collections, then only the children of the site collection referenced by the URL are crawled.
    • Web Applications - If the Seed URL Type is set to Web Applications, then all of the site collections (and their children) belonging to the web applications referenced by each URL are crawled.
  • auth-type - The authentication mechanism to use when contacting the SharePoint server: BASIC, NTLM2, KERBEROS, or CBA. The default authentication type is NTLM2.

  • spUser - User name of the SharePoint user used to access the content. This user must have access to all the target sites and lists to be crawled and indexed, and must be able to retrieve and resolve the associated permissions. It is better to enter it with the domain name, like: MYDOMAIN\\Administrator.

  • spPassword - Password of the SharePoint user used to access the content. Password must be encrypted using the vcrypt program shipped with the Data Crawler.

  • cba-sts - The URL for the Security Token Service (STS) endpoint to attempt to authenticate the crawl user against. For SharePoint on-premise with ADFS, this should be your ADFS endpoint. If the Authentication Type is set to CBA (Claims Based Authentication), then this field is required.

  • cba-realm - The relaying party trust identifier to use when requesting a security token from the STS. This is sometimes known as the "AppliesTo" value, or the "Realm". For SharePoint Online, this should be the URL to the root of the SharePoint Online instance (for example, https://mycompany.sharepoint.com). For ADFS, this is the ID value for the Relying Party Trust between SharePoint and ADFS (for example, "urn:SHAREPOINT:adfs").

  • everyone-group - When specified, this group name is used in the ACLs when access should be given to everyone. This field is required when crawling user profiles is enabled.

    Note: Security is not yet respected by the Retrieve and Rank service.

  • user-profile-master-url - The base URL that the connector uses to build links to user profiles. This should be configured to point to the display form for user profiles. If the token %FIRST_SEED% is encountered, it is replaced with the first seed URL. Required when crawling user profiles is enabled.

  • urls - Newline-separated list of HTTP URLs of SharePoint web applications or site collections to crawl.

  • ehcache-config - Unused option.

  • method - The method (GET or POST) by which parameters will be passed.

  • cache-types - Unused option.

  • cache-size - Unused option.

  • enable-acl - Enables crawling of SharePoint user profiles; values are true or false; default value is false.

Configuring the SharePoint Crawl Seed

The following additional values can be configured for the SharePoint crawl seed file. To set these values, open the file config/seeds/sharepoint-seed.conf and specify the following values specific to your use cases:

  • url - Newline-separated list of URLs of SharePoint web applications or site collections to crawl. For example:

    io-sp://a.com
    io-sp://b.com:83/site
    io-sp://c.com/site2
    

    The sub-sites of these sites will also be crawled (unless they are excluded by other crawling rules).

  • filter-url - Newline-separated list of URLs of SharePoint web applications or site collections to crawl. For example:

    http://a.com
    http://b.com:83/site
    http://c.com/site2
    
  • hops - Internal use only.

  • n-concurrent-requests - Internal use only.

  • delay - Internal use only.

  • default-allow - Internal use only.

  • seed-protocol - Sets the seed protocol for children of the site collection. Necessary when the site collection's protocol is SSL, HTTP, or HTTPS. This value must be set the same as the site collection's protocol.

Configuring Orchestration service options

The orchestration service tells the crawler how to manage crawled files.

Important: To access the in-product manual for the orchestration-service.conf file, with the most up-to-date information, type the following command from the Crawler installation directory:

man orchestration_service.conf

Default options can be changed directly by opening the config/orchestration/orchestration-service.conf file, and specifying the following values specific to your use case:

  • http_timeout - The timeout, in seconds, for the document read/index operation; the default is 585.

  • concurrent_upload_connection_limit - The number of simultaneous connections allowed for uploading documents; the default is 10.

    Note: Generally, this number should be greater than, or equal to, the output_limit set when configuring crawl options.

  • base_url - The URL to which your crawled documents will be sent.

  • endpoint - The location of your crawled document collection at the base URL.

  • username - Username to authenticate to the endpoint location.

  • password - Password to authenticate to the endpoint location.

    Important: Do NOT use the vcrypt program shipped with the Data Crawler to encrypt this password.

  • config_file - The configuration file that the orchestration service uses. This is the file that you created using the Kale command line tool.

The Orchestration Service Output Adapter can send statistics in order for IBM to better understand and serve its users. The following options can be set for the send_stats variable:

  • jvm - Java Virtual Machine (JVM) statistics sent include the Java vendor and version, as reported by the JVM used to execute the data crawler. Value is either true or false. Default value is true.
  • os - Operating system (OS) statistics sent include OS name, version, and architecture, as reported by the JVM used to execute the data crawler. Value is either true or false. Default value is true.

Crawling your data repository

After the crawler options have all been properly configured, you can run a crawl against your data repository.

Note: Never run the Crawler as root, unless you need access to files only root can read.

Run the following command:

crawler

The Crawler will prompt you with documentation that explains what to do. You can run a test crawl, or run a crawl, in addition to other crawl options.

Running a test crawl

Run the following command

crawler testit

This will run a test crawl, which crawls only the seed URL and displays any enqueued URLs. If the seed URL results in indexable content (for example, it is a document), then that content is sent to the output adapter, and the content is printed to the screen. If the seed URL retrieval causes URLs to be enqueued, those URLs will be displayed, and no content will be sent to the output adapter. By default, five enqueued URLs are displayed.

You can also specify a custom configuration file as an option for the crawl command, for example:

crawler testit --config [config/myconfigfile.conf]

Note: The path to the configuration file passed in the --config option must be a qualified path. That is, it must be in relative formats, such as config/myconfigfile.conf or ./myconfigfile.conf, or in an absolute path such as /path/to/config/myconfigfile.conf. Specifying just myconfigfile.conf is only possible if the orchestration_service.conf file is in-lined, instead of referenced using include in the crawler.conf file.

Additionally, you can set the limit for the number of enqueued URLs that are displayed as an option for the testit command, for example:

crawler testit --limit [number]

Running a crawl

Run the following command:

crawler crawl

This will run a crawl with the default configuration file (crawler.conf).

You can also specify a custom configuration file as an option for the crawl command, for example:

crawler crawl --config [config/myconfigfile.conf]

Note: The path to the configuration file passed in the --config option must be a qualified path. That is, it must be in relative formats, such as config/myconfigfile.conf or ./myconfigfile.conf, or in an absolute path such as /path/to/config/myconfigfile.conf. Specifying just myconfigfile.conf is only possible if the orchestration_service.conf file is in-lined, instead of referenced using include in the crawler.conf file.

Testing Enhanced Information Retrieval

After you have:

  • Created and configured the Retrieve and Rank and Document Conversion services with Kale
  • Downloaded the Data Crawler, and copied the configuration files to it
  • Uploaded your documents, and run a crawl

The Retrieve and Rank service instance has your data, and you can test the accuracy of Enhanced Information Retrieval. This is done by doing a search for terms that you are confident should return results.

Enter:

kale search {field:value}

An example is: kale search body:John, which will return documents from the collection that contain the string John in the body field of the document. The fields available for search depend on the Solr configuration used to create the collection. In general, a configuration will contain a body and title field.

The names of the documents that contain the search term will display.

You have now completed an Enhanced Information Retrieval solution that can be integrated with other applications or services.

Additional Resources

See the sample Retrieve and Rank application for a working application with which you can integrate your crawled data.

You can improve your results by training the ranker in the Retrieve and Rank service. See Preparing Training Data for more information.

You can also see the Retrieve and Rank service API documentation for information about how to integrate your crawled data with your own custom applications.