Integrating with IBM Open Platform with Apache Hadoop

Note: This feature is deprecated. See Analytical Components release notes for more information about deprecated features and what that means.

To handle large amounts of data, you can configure an IBM Open Platform with Apache Hadoop server to help process the data. You can also run custom global analysis tasks as Jaql scripts on the InfoSphere® BigInsights® server.

Before you begin

Restriction: The integration with InfoSphere BigInsights is supported only on Linux® systems.

About this task

InfoSphere BigInsights is a platform for managing and analyzing large amounts of structured and unstructured data in a reliable, fault-tolerant manner. InfoSphere BigInsights is based on Apache Hadoop, an open source, distributed computing platform. When you integrate with Watson™ Explorer Content Analytics with InfoSphere BigInsights, the Watson Explorer Content Analytics indexing and global analysis processes run on InfoSphere BigInsights.

For the latest information about the versions of IBM Open Platform with Apache Hadoop that are supported by Watson Explorer Content Analytics, see the system requirements technote.

Procedure

To integrate with InfoSphere BigInsights:

Verify that InfoSphere BigInsights is installed and configured properly.
Ensure that the InfoSphere BigInsights administrator user ID and group ID are the same on all nodes in the cluster. The product must be installed with the following component:
- Hadoop
In addition, a shared POSIX file system is required.

For information about prerequisites and installation procedures, see the IBM Open Platform with Apache Hadoop documentation.
Install Watson Explorer Content Analytics according to the following guidelines.
- Install the master server on the same computer as the InfoSphere BigInsights management node.
- Install the data directory (ES_NODE_ROOT) on a shared file system, such as IBM General Parallel File System (GPFS). If you install the data directory on Network File System (NFS), you must export it with the no_root_squash option for at least the master server. The shared file system must be mounted on all InfoSphere BigInsights nodes. Ensure that the path to the shared file system is the same on all nodes. If you use NFS, ES_NODE_ROOT must be mounted with the exec and rw options.
- For the installation directory, specify a location on the local computer.
- Ensure that the same administrator ID is used for InfoSphere BigInsights and Watson Explorer Content Analytics.
Configure Watson Explorer Content Analytics to use the InfoSphere BigInsights server.
1. Verify that InfoSphere BigInsights is running
  To configure Watson Explorer Content Analytics to use the InfoSphere BigInsights server, the system must be able to connect to the server.
2. In the Watson Explorer Content Analytics administration console, click the System tab and click Configure Apache Hadoop Server.
3. Click OK.
  You must also specify the port on which the JobTracker runs and the path to the InfoSphere BigInsights distributed file system.
  Tip: By default, all crawled documents are indexed as a single job on the InfoSphere BigInsights server. But in some cases you might want to limit the amount of data that is processed per job. For example, you might need to minimize the temporary storage that is required for Hadoop and Jaql on the InfoSphere BigInsights server. Alternatively, you might want to divide the indexing task into shorter jobs to so that parts of the index are available more quickly. In such a case, specify a limit in the Maximum amount of data to process per indexing job (in megabytes) advanced option.
  The total amount of time that is required to process all the input data is greater if the data is processed in multiple jobs. This time increase arises from the overhead associated with Hadoop and Jaql. In addition, global analysis and optional facet indexing runs for each indexing job.
Configure a collection to use the InfoSphere BigInsights server.
When you create or clone a collection, select Use IBM InfoSphere BigInsights. The following differences apply for collections that use InfoSphere BigInsights:
- If you stop a rebuild of the index before the process completes, all progress is lost. For collections that do not use InfoSphere BigInsights, if you stop and restart the rebuild index process, the system resumes processing for only those documents that were not yet indexed.
- Thumbnails are generated during the text processing pipeline instead of during global analysis. As a result, thumbnails are always processed when the index is rebuilt. Unlike for other collections, you cannot skip the thumbnail generation process when you rebuild the index. Also, the document cache is not required for thumbnail generation.
- The optional facet index can be enabled for enterprise search collections that use InfoSphere BigInsights. If the InfoSphere BigInsights index was created with partitions, you must rebuild the optional facet index when you add a search server to the system topology.
- Document flags are not supported.
- Reorganizing the index is not supported.
- To view details about documents that were dropped because they cannot be parsed or indexed, the search server must be running.
- The real-time natural language processing (NLP) API is not supported.
Optional: Create and deploy a custom global analysis plug-in to run custom Jaql scripts on the InfoSphere BigInsights server.
Use custom global analysis to implement custom logic that analyzes the entire document set.

What to do next

If indexing activity seems to be stalled in the administration console, use the InfoSphere BigInsights administration console to check the document processing and indexing status. Until indexing is complete, the status is not relayed to the Watson Explorer Content Analytics administration console from the InfoSphere BigInsights server.
Watson Explorer Content Analytics uses an orchestrator to manage Jaql processes. When parsing and indexing is stopped, some processes on the InfoSphere BigInsights server might continue to run. These processes are orphaned. Before you restart an index build, you must manually stop these processes. Stop the processes by using the InfoSphere BigInsights administration console or by entering the hadoop command at a command prompt.
If you install IBM Open Platform with Apache Hadoop Version 1.3 Fix Pack 1 or later, you must run the escrbi.sh script, which is provided with Watson Explorer Content Analytics. The script updates the class paths for sessions that require Hadoop libraries from the InfoSphere BigInsights server. Follow these steps:
1. Run the script: ES_INSTALL_ROOT/bin/escrbi.sh
2. Stop the Watson Explorer Content Analytics system: esadmin system stopall
3. Restart the Watson Explorer Content Analytics system: esadmin system startall