Integrating with IBM InfoSphere BigInsights

To handle large amounts of data, you can configure an IBM® InfoSphere® BigInsights server to help process the data. You can also run custom global analysis tasks as Jaql scripts on the InfoSphere BigInsights server.

Before you begin

Restriction: The integration with InfoSphere BigInsights is supported only on Linux systems.

About this task

InfoSphere BigInsights is a platform for managing and analyzing large amounts of structured and unstructured data in a reliable, fault-tolerant manner. InfoSphere BigInsights is based on Apache Hadoop, an open source, distributed computing platform. When you integrate with Watson Explorer Content Analytics with InfoSphere BigInsights, the Watson Explorer Content Analytics indexing and global analysis processes run on InfoSphere BigInsights.

For the latest information about the versions of IBM InfoSphere BigInsights that are supported by Watson Explorer Content Analytics, see the system requirements technote.

Procedure

To integrate with InfoSphere BigInsights:

  1. Verify that InfoSphere BigInsights is installed and configured properly. Ensure that the InfoSphere BigInsights administrator user ID and group ID are the same on all nodes in the cluster. The product must be installed with the following components:
    • Jaql
    • InfoSphere BigInsights orchestrator
    • BigIndex
    • Hadoop

    In addition, a shared POSIX file system is required.

    For information about prerequisites and installation procedures, see the IBM InfoSphere BigInsights documentation.

  2. Install Watson Explorer Content Analytics according to the following guidelines.
    • Install the master server on the same computer as the InfoSphere BigInsights management node.
    • Install the data directory (ES_NODE_ROOT) on a shared file system, such as IBM General Parallel File System (GPFS™). If you install the data directory on Network File System (NFS), you must export it with the no_root_squash option for at least the master server. The shared file system must be mounted on all InfoSphere BigInsights nodes. Ensure that the path to the shared file system is the same on all nodes. If you use NFS, ES_NODE_ROOT must be mounted with the exec and rw options.
    • For the installation directory, specify a location on the local computer.
    • Ensure that the same administrator ID is used for InfoSphere BigInsights and Watson Explorer Content Analytics.
  3. Configure Watson Explorer Content Analytics to use the InfoSphere BigInsights server.
    1. Verify that InfoSphere BigInsights is running To configure Watson Explorer Content Analytics to use the InfoSphere BigInsights server, the system must be able to connect to the server.
    2. In the Watson Explorer Content Analytics administration console, click the System tab and click Configure IBM InfoSphere BigInsights.
    3. On the Configure IBM InfoSphere BigInsights Server page, specify the host name of the InfoSphere BigInsights JobTracker node. You must also specify the port on which the JobTracker runs and the path to the InfoSphere BigInsights distributed file system.
      Tip: By default, all crawled documents are indexed as a single job on the InfoSphere BigInsights server. But in some cases you might want to limit the amount of data that is processed per job. For example, you might need to minimize the temporary storage that is required for Hadoop and Jaql on the InfoSphere BigInsights server. Alternatively, you might want to divide the indexing task into shorter jobs to so that parts of the index are available more quickly. In such a case, specify a limit in the Maximum amount of data to process per indexing job (in megabytes) advanced option.

      The total amount of time that is required to process all the input data is greater if the data is processed in multiple jobs. This time increase arises from the overhead associated with Hadoop and Jaql. In addition, global analysis and optional facet indexing runs for each indexing job.

  4. Configure a collection to use the InfoSphere BigInsights server. When you create or clone a collection, select Use IBM InfoSphere BigInsights. The following differences apply for collections that use InfoSphere BigInsights:
    • If you stop a rebuild of the index before the process completes, all progress is lost. For collections that do not use InfoSphere BigInsights, if you stop and restart the rebuild index process, the system resumes processing for only those documents that were not yet indexed.
    • Thumbnails are generated during the text processing pipeline instead of during global analysis. As a result, thumbnails are always processed when the index is rebuilt. Unlike for other collections, you cannot skip the thumbnail generation process when you rebuild the index. Also, the document cache is not required for thumbnail generation.
    • The optional facet index can be enabled for enterprise search collections that use InfoSphere BigInsights. If the InfoSphere BigInsights index was created with partitions, you must rebuild the optional facet index when you add a search server to the system topology.
    • Document flags are not supported.
    • Reorganizing the index is not supported.
    • To view details about documents that were dropped because they cannot be parsed or indexed, the search server must be running.
    • The real-time natural language processing (NLP) API is not supported.
  5. Optional: Create and deploy a custom global analysis plug-in to run custom Jaql scripts on the InfoSphere BigInsights server. Use custom global analysis to implement custom logic that analyzes the entire document set.

What to do next