To handle large amounts of data, you can configure an IBM® InfoSphere® BigInsights server to help
process the data. You can also run custom global analysis tasks as
Jaql scripts on the BigInsights server.
Restriction: The integration with IBM InfoSphere BigInsights
is supported only on Linux platforms.
IBM InfoSphere BigInsights is a platform for
managing and analyzing large amounts of structured and unstructured
data in a reliable, fault-tolerant manner. BigInsights is based on
Apache Hadoop, an open source, distributed computing platform. When
you integrate with IBM Content
Analytics with Enterprise Search with IBM InfoSphere BigInsights, the IBM Content
Analytics with Enterprise Search indexing and global analysis
processes run on IBM InfoSphere BigInsights.
For
the latest information about the versions of IBM InfoSphere BigInsights
that are supported by IBM Content
Analytics with Enterprise Search,
see the system requirements technote.
To integrate with IBM InfoSphere BigInsights:
- Verify that IBM InfoSphere BigInsights is
installed and configured properly. Ensure that the BigInsights
administrator user ID and group ID are the same on all nodes in the
cluster. BigInsights must be installed with the following components:
- Jaql
- BigInsights orchestrator
- BigIndex
- Hadoop
In addition, a shared POSIX file system is required.
For IBM InfoSphere BigInsights prerequisites information
and installation procedures, see the IBM InfoSphere BigInsights Information
Center.
- Install IBM Content
Analytics with Enterprise Search according
to the following guidelines.
- Install the master server on the same computer as the BigInsights
management node.
- Install the data directory (ES_NODE_ROOT) on a shared file system,
such as IBM General Parallel
File System (GPFS™). If you install
the data directory on Network File System (NFS), you must export it
with the no_root_squash option for at least the master
server. The shared file system must be mounted on all nodes of BigInsights.
Ensure that the path to the shared file system is the same on all
nodes. If you use NFS, ES_NODE_ROOT must be mounted with the exec and rw options.
- For the installation directory, specify a location on the local
computer.
- Ensure that the same administrator ID is used for BigInsights
and IBM Content
Analytics with Enterprise Search.
- Configure IBM Content
Analytics with Enterprise Search to
use the BigInsights server.
- Verify that BigInsights is running To configure IBM Content
Analytics with Enterprise Search to use a BigInsights server,
the system must be able to connect to the BigInsights server.
- In the IBM Content
Analytics with Enterprise Search administration
console, click the System tab and click Configure
IBM InfoSphere BigInsights Server.
- On the Configure IBM InfoSphere BigInsights
Server page, specify the host name of the BigInsights JobTracker
node. You must also specify the port on which the JobTracker
runs and the path to the BigInsights distributed file system.
Tip: By default, all crawled documents are indexed as a single
job on the BigInsights server. But in some cases you might want to
limit the amount of data that is processed per job. For example, you
might need to minimize the temporary storage that is required for
Hadoop and Jaql on the BigInsights server. Alternatively, you might
want to divide the indexing task into shorter jobs to so that parts
of the index are available more quickly. In such a case, specify a
limit in the
Maximum amount of data to process per indexing
job (in megabytes) advanced option.
Be aware that the
total amount of time required to process all the input data is greater
if the data is processed in multiple jobs. This time increase is due
to the overheads associated with Hadoop and Jaql. In addition, global
analysis and optional facet indexing runs for each indexing job.
- Configure a collection to use the BigInsights server. When you create or clone a collection, select Use
IBM InfoSphere BigInsights. The following differences
apply for collections that use IBM InfoSphere BigInsights:
- If you stop a rebuild of the index before the process completes,
all progress is lost. For collections that do not use IBM InfoSphere BigInsights,
if you stop and restart the rebuild index process, the system resumes
processing for only those documents that were not yet indexed.
- Thumbnails are generated during the text processing pipeline instead
of during global analysis. As a result, thumbnails are always processed
when the index is rebuilt (that is, unlike for other collections,
you cannot skip the thumbnail generation process when you rebuild
the index). Also, the document cache is not required for thumbnail
generation.
- The optional facet index can be enabled for search collections
that use BigInsights.
- Document flags are not supported.
- Reorganizing the index is not supported.
- To view details about documents that were dropped because they
could not be parsed or indexed, the search server must be running.
- The real-time natural language processing (NLP) API is not supported.
- Optional: Create and deploy a custom global
analysis plug-in to run custom Jaql scripts on the BigInsights server. Use custom global analysis to implement custom logic that analyzes
the entire document set.