To handle large amounts of data, you can configure an IBM® InfoSphere® BigInsights server to help process
the data. You can also run custom global analysis tasks as Jaql scripts
on the InfoSphere BigInsights server.
Before you begin
Restriction: The integration with InfoSphere BigInsights is supported only on Linux systems.
About this task
InfoSphere BigInsights is
a platform for managing and analyzing large amounts of structured
and unstructured data in a reliable, fault-tolerant manner. InfoSphere BigInsights is based on Apache Hadoop,
an open source, distributed computing platform. When you integrate
with Watson Explorer Content Analytics with InfoSphere BigInsights, the Watson Explorer Content Analytics indexing and global analysis
processes run on InfoSphere BigInsights.
For
the latest information about the versions of IBM InfoSphere BigInsights that are supported by Watson Explorer Content Analytics, see the system requirements
technote.
Procedure
To integrate with InfoSphere BigInsights:
- Verify that InfoSphere BigInsights is
installed and configured properly. Ensure that the InfoSphere BigInsights administrator user ID and group ID
are the same on all nodes in the cluster. The product must be installed with the following components:
- Jaql
- InfoSphere BigInsights orchestrator
- BigIndex
- Hadoop
In addition, a shared POSIX file system is required.
For information about
prerequisites and installation procedures, see the IBM InfoSphere BigInsights documentation.
- Install Watson Explorer Content Analytics according
to the following guidelines.
- Install the master server on the same computer as the InfoSphere BigInsights management node.
- Install the data directory (ES_NODE_ROOT) on a shared file system,
such as IBM General Parallel
File System (GPFS™). If you install
the data directory on Network File System (NFS), you must export it
with the no_root_squash option for at least the master
server. The shared file system must be mounted on all InfoSphere BigInsights nodes. Ensure that the
path to the shared file system is the same on all nodes. If you use
NFS, ES_NODE_ROOT must be mounted with the exec and rw options.
- For the installation directory, specify a location on the local
computer.
- Ensure that the same administrator ID is used for InfoSphere BigInsights and Watson Explorer Content Analytics.
- Configure Watson Explorer Content Analytics to
use the InfoSphere BigInsights server.
- Verify that InfoSphere BigInsights is
running To configure Watson Explorer Content Analytics to
use the InfoSphere BigInsights server,
the system must be able to connect to the server.
- In the Watson Explorer Content Analytics administration
console, click the System tab and click Configure IBM InfoSphere BigInsights.
- On the Configure IBM InfoSphere BigInsights
Server page, specify the host name of the InfoSphere BigInsights JobTracker node. You
must also specify the port on which the JobTracker runs and the path
to the InfoSphere BigInsights distributed
file system.
Tip: By default, all crawled documents are
indexed as a single job on the
InfoSphere BigInsights server.
But in some cases you might want to limit the amount of data that
is processed per job. For example, you might need to minimize the
temporary storage that is required for Hadoop and Jaql on the
InfoSphere BigInsights server. Alternatively,
you might want to divide the indexing task into shorter jobs to so
that parts of the index are available more quickly. In such a case,
specify a limit in the
Maximum amount of data to process
per indexing job (in megabytes) advanced option.
The
total amount of time that is required to process all the input data
is greater if the data is processed in multiple jobs. This time increase
arises from the overhead associated with Hadoop and Jaql. In addition,
global analysis and optional facet indexing runs for each indexing
job.
- Configure a collection to use the InfoSphere BigInsights server. When
you create or clone a collection, select Use IBM InfoSphere
BigInsights. The following differences apply for collections
that use InfoSphere BigInsights:
- If you stop a rebuild of the index before the process completes,
all progress is lost. For collections that do not use InfoSphere BigInsights, if you stop and restart
the rebuild index process, the system resumes processing for only
those documents that were not yet indexed.
- Thumbnails are generated during the text processing pipeline instead
of during global analysis. As a result, thumbnails are always processed
when the index is rebuilt. Unlike for other collections, you cannot
skip the thumbnail generation process when you rebuild the index.
Also, the document cache is not required for thumbnail generation.
- The optional facet index can be enabled for enterprise
search collections that use InfoSphere BigInsights.
If the InfoSphere BigInsights index
was created with partitions, you must rebuild the optional facet index
when you add a search server to the system topology.
- Document flags are not supported.
- Reorganizing the index is not supported.
- To view details about documents that were dropped because they
cannot be parsed or indexed, the search server must be running.
- The real-time natural language processing (NLP) API is not supported.
- Optional: Create and deploy a custom global
analysis plug-in to run custom Jaql scripts on the InfoSphere BigInsights server. Use
custom global analysis to implement custom logic that analyzes the
entire document set.
What to do next
- If indexing activity seems to be stalled in the administration console, use the InfoSphere BigInsights administration console to check the document processing and
indexing status. Until indexing is complete, the status is not relayed to the Watson Explorer Content Analytics administration console from the InfoSphere BigInsights server.
- Watson Explorer Content Analytics uses an orchestrator to manage Jaql processes.
When parsing and indexing is stopped, some processes on the InfoSphere BigInsights server might continue to run. These processes are orphaned.
Before you restart an index build, you must manually stop these processes. Stop the processes
by using the InfoSphere BigInsights administration console or by entering
the hadoop command at a command prompt.
- If you install IBM InfoSphere BigInsights Version 1.3 Fix Pack 1 or later,
you must run the escrbi.sh script, which is provided with Watson Explorer Content Analytics. The script updates the class paths for sessions that
require Hadoop libraries from the InfoSphere BigInsights server. Follow
these steps:
- Run the script: ES_INSTALL_ROOT/bin/escrbi.sh
- Stop the Watson Explorer Content Analytics system: esadmin system
stopall
- Restart the Watson Explorer Content Analytics system: esadmin system
startall