To handle large amounts of data, you can configure an IBM Open Platform with Apache Hadoop server to help process the data. You can also run custom global
analysis tasks as Jaql scripts on the InfoSphere® BigInsights®
server.
Before you begin
Restriction: The integration with InfoSphere BigInsights is
supported only on Linux® systems.
About this task
InfoSphere BigInsights is a platform for managing and analyzing large
amounts of structured and unstructured data in a reliable, fault-tolerant manner. InfoSphere BigInsights is based on Apache Hadoop, an open source, distributed computing
platform. When you integrate with Watson™ Explorer Content Analytics with InfoSphere BigInsights, the Watson Explorer Content Analytics indexing and
global analysis processes run on InfoSphere BigInsights.
For the latest information about the versions of IBM Open Platform with Apache Hadoop
that are supported by Watson Explorer Content Analytics, see the system requirements
technote.
Procedure
To integrate with InfoSphere BigInsights:
-
Verify that InfoSphere BigInsights is installed and configured
properly.
Ensure that the
InfoSphere BigInsights administrator user ID and group ID
are the same on all nodes in the cluster. The product must be installed with the following component:
In addition, a shared POSIX file system is required.
For information about
prerequisites and installation procedures, see the IBM Open Platform with Apache Hadoop
documentation.
-
Install Watson Explorer Content Analytics according to the following
guidelines.
- Install the master server on the same computer as the InfoSphere BigInsights management node.
- Install the data directory (ES_NODE_ROOT) on a shared file system, such as IBM General Parallel File System (GPFS).
If you install the data directory on Network File System (NFS), you must export it with the
no_root_squash option for at least the master server. The shared file system must
be mounted on all InfoSphere BigInsights nodes. Ensure that the path to the
shared file system is the same on all nodes. If you use NFS, ES_NODE_ROOT must be mounted with the
exec and rw options.
- For the installation directory, specify a location on the local computer.
- Ensure that the same administrator ID is used for InfoSphere BigInsights
and Watson Explorer Content Analytics.
-
Configure Watson Explorer Content Analytics to use the InfoSphere BigInsights server.
-
Verify that InfoSphere BigInsights is running
To configure Watson Explorer Content Analytics to use the InfoSphere BigInsights server, the system must be able to connect to the server.
-
In the Watson Explorer Content Analytics administration console, click the
System tab and click Configure Apache Hadoop
Server.
-
Click OK.
You must also specify the port on which the JobTracker runs and the path to the
InfoSphere BigInsights distributed file system.
Tip: By default, all crawled
documents are indexed as a single job on the
InfoSphere BigInsights server. But
in some cases you might want to limit the amount of data that is processed per job. For example, you
might need to minimize the temporary storage that is required for Hadoop and Jaql on the
InfoSphere BigInsights server. Alternatively, you might want to divide the indexing task
into shorter jobs to so that parts of the index are available more quickly. In such a case, specify
a limit in the
Maximum amount of data to process per indexing job (in
megabytes) advanced option.
The total amount of time that is required to process all
the input data is greater if the data is processed in multiple jobs. This time increase arises from
the overhead associated with Hadoop and Jaql. In addition, global analysis and optional facet
indexing runs for each indexing job.
-
Configure a collection to use the InfoSphere BigInsights server.
When you create or clone a collection, select
Use IBM InfoSphere
BigInsights. The following differences apply for collections that use
InfoSphere BigInsights:
- If you stop a rebuild of the index before the process completes, all progress is lost. For
collections that do not use InfoSphere BigInsights, if you stop and restart the
rebuild index process, the system resumes processing for only those documents that were not yet
indexed.
- Thumbnails are generated during the text processing pipeline instead of during global analysis.
As a result, thumbnails are always processed when the index is rebuilt. Unlike for other
collections, you cannot skip the thumbnail generation process when you rebuild the index. Also, the
document cache is not required for thumbnail generation.
- The optional facet index can be enabled for enterprise search collections that use
InfoSphere BigInsights. If the InfoSphere BigInsights
index was created with partitions, you must rebuild the optional facet index when you add a search
server to the system topology.
- Document flags are not supported.
- Reorganizing the index is not supported.
- To view details about documents that were dropped because they cannot be parsed or indexed, the
search server must be running.
- The real-time natural language processing (NLP) API is not supported.
- Optional:
Create and deploy a custom global analysis plug-in to run custom Jaql scripts on the InfoSphere BigInsights server.
Use custom global analysis to implement custom logic that analyzes the entire document
set.
What to do next
- If indexing activity seems to be stalled in the administration console, use the InfoSphere BigInsights administration console to check the document processing and
indexing status. Until indexing is complete, the status is not relayed to the Watson Explorer Content Analytics administration console from the InfoSphere BigInsights server.
- Watson Explorer Content Analytics uses an orchestrator to manage Jaql processes. When
parsing and indexing is stopped, some processes on the InfoSphere BigInsights
server might continue to run. These processes are orphaned. Before you restart an index build, you
must manually stop these processes. Stop the processes by using the InfoSphere BigInsights administration console or by entering the
hadoop command at a command prompt.
- If you install IBM Open Platform with Apache Hadoop Version 1.3 Fix Pack 1 or later, you
must run the escrbi.sh script, which is provided with Watson Explorer Content Analytics. The script updates the class paths for sessions that require
Hadoop libraries from the InfoSphere BigInsights server. Follow these steps:
- Run the script: ES_INSTALL_ROOT/bin/escrbi.sh
- Stop the Watson Explorer Content Analytics system: esadmin system
stopall
- Restart the Watson Explorer Content Analytics system: esadmin system
startall