IBM Content Analytics with Enterprise Search, Version 3.0.0

Integrating with IBM InfoSphere BigInsights

To handle large amounts of data, you can configure an IBM® InfoSphere® BigInsights server to help process the data. You can also run custom global analysis tasks as Jaql scripts on the BigInsights server.

Restriction: The integration with IBM InfoSphere BigInsights is supported only on Linux platforms.

IBM InfoSphere BigInsights is a platform for managing and analyzing large amounts of structured and unstructured data in a reliable, fault-tolerant manner. BigInsights is based on Apache Hadoop, an open source, distributed computing platform. When you integrate with IBM Content Analytics with Enterprise Search with IBM InfoSphere BigInsights, the IBM Content Analytics with Enterprise Search indexing and global analysis processes run on IBM InfoSphere BigInsights.

For the latest information about the versions of IBM InfoSphere BigInsights that are supported by IBM Content Analytics with Enterprise Search, see the system requirements technote.

To integrate with IBM InfoSphere BigInsights:

Verify that IBM InfoSphere BigInsights is installed and configured properly. Ensure that the BigInsights administrator user ID and group ID are the same on all nodes in the cluster. BigInsights must be installed with the following components:
- Jaql
- BigInsights orchestrator
- BigIndex
- Hadoop
In addition, a shared POSIX file system is required.

For IBM InfoSphere BigInsights prerequisites information and installation procedures, see the IBM InfoSphere BigInsights Information Center.
Install IBM Content Analytics with Enterprise Search according to the following guidelines.
- Install the master server on the same computer as the BigInsights management node.
- Install the data directory (ES_NODE_ROOT) on a shared file system, such as IBM General Parallel File System (GPFS™). If you install the data directory on Network File System (NFS), you must export it with the no_root_squash option for at least the master server. The shared file system must be mounted on all nodes of BigInsights. Ensure that the path to the shared file system is the same on all nodes. If you use NFS, ES_NODE_ROOT must be mounted with the exec and rw options.
- For the installation directory, specify a location on the local computer.
- Ensure that the same administrator ID is used for BigInsights and IBM Content Analytics with Enterprise Search.
Configure IBM Content Analytics with Enterprise Search to use the BigInsights server.
1. Verify that BigInsights is running To configure IBM Content Analytics with Enterprise Search to use a BigInsights server, the system must be able to connect to the BigInsights server.
2. In the IBM Content Analytics with Enterprise Search administration console, click the System tab and click Configure IBM InfoSphere BigInsights Server.
3. On the Configure IBM InfoSphere BigInsights Server page, specify the host name of the BigInsights JobTracker node. You must also specify the port on which the JobTracker runs and the path to the BigInsights distributed file system.
  Tip: By default, all crawled documents are indexed as a single job on the BigInsights server. But in some cases you might want to limit the amount of data that is processed per job. For example, you might need to minimize the temporary storage that is required for Hadoop and Jaql on the BigInsights server. Alternatively, you might want to divide the indexing task into shorter jobs to so that parts of the index are available more quickly. In such a case, specify a limit in the Maximum amount of data to process per indexing job (in megabytes) advanced option.
  Be aware that the total amount of time required to process all the input data is greater if the data is processed in multiple jobs. This time increase is due to the overheads associated with Hadoop and Jaql. In addition, global analysis and optional facet indexing runs for each indexing job.
Configure a collection to use the BigInsights server. When you create or clone a collection, select Use IBM InfoSphere BigInsights. The following differences apply for collections that use IBM InfoSphere BigInsights:
- If you stop a rebuild of the index before the process completes, all progress is lost. For collections that do not use IBM InfoSphere BigInsights, if you stop and restart the rebuild index process, the system resumes processing for only those documents that were not yet indexed.
- Thumbnails are generated during the text processing pipeline instead of during global analysis. As a result, thumbnails are always processed when the index is rebuilt (that is, unlike for other collections, you cannot skip the thumbnail generation process when you rebuild the index). Also, the document cache is not required for thumbnail generation.
- The optional facet index can be enabled for search collections that use BigInsights.
- Document flags are not supported.
- Reorganizing the index is not supported.
- To view details about documents that were dropped because they could not be parsed or indexed, the search server must be running.
- The real-time natural language processing (NLP) API is not supported.
Optional: Create and deploy a custom global analysis plug-in to run custom Jaql scripts on the BigInsights server. Use custom global analysis to implement custom logic that analyzes the entire document set.

Feedback