IBM Support

Toolkit for importing HDFS files into Hadoop-enabled collections

Product Documentation


Abstract

Watson Explorer Advanced Edition Version 11.0.0.1 includes a toolkit for importing HDFS files into a collection that is configured to support Apache Hadoop.

Content

The Hadoop Distributed File System (HDFS) usually contains a large number of files and a large amount of data. When importing HDFS files in previous versions of Watson Explorer Content Analytics, each file was treated as single document. The new Hadoop import toolkit speeds up the import process by using the Hadoop Map/Reduce interface class, an interface that is supported by many programs that support Apache Hadoop.

You can use the toolkit to create custom code for generating raw data store (RDS) files from various input formats. The generated RDS files can be directly imported into a Hadoop-enabled collection and indexed for content analytics.

Important: The imported files are moved into the RDS directory when the import request is processed. Documents previously imported by this command into the same crawl space are removed before new documents are imported.

Sample code is provided in the ES_INSTALL_ROOT/samples/hadoop_toolkit directory. The sample code processes tab-delimited CSV files on HDFS and generates RDS files.

The sample code requires Java SDK and Apache Ant. To do any of the following steps, you must log in as the Content Analytics administrator.

Build the sample code:

1. Set up the Hadoop environment:

    escrbi.sh
2. Go to the toolkit sample directory:
    cd $ES_INSTALL_ROOT/samples/hadoop_toolkit
3. Build the sample code:
    ant

Use the sample code to import files:

1. Create a collection, for example, hadoop_toolkit.
2. Import index fields from the $ES_INSTALL_ROOT/samples/hadoop_toolkit/index_fields.xml file.
3. Run map reduce:
    hadoop jar $ES_INSTALL_ROOT/samples/hadoop_toolkit/build/toolkit_sample.jar <collection ID>
    The input file "esdata/sample/input/sample.csv" will be created on HDFS.
    RDS files will be created in the esdata/sample/output directory on HDFS.
4. Import the RDS files into a collection:
    $ esadmin importer importExternalRDS -cid hadoop_toolkit -url esdata/sample/output

    Results similar to the following messages are logged:

    FFQC5303I Importer Service (node1) (sid: importer) CCL session exists. PID: 54036
    FFQC5314I The following result occurred: hdfs://...user/esadmin/esdata/sample/output

5. The pending job can be listed as follows:
    $ esadmin importer listPendingExternalRDS -cid hadoop_toolkit
    The output shows the job property. The key and value are the request time (in epoch time) and the path of the RDS directory. For example:
    FFQC5303I Importer Service (node1) (sid: importer) CCL session exists. PID: 54036
    FFQC5314I The following result occurred: #Wed Sep 16 11:31:02 JST 2015
    #Wed Sep 16 11:31:02 JST 2015
    1442370395033=hdfs\://.../user/esadmin/esdata/sample/output
[{"Product":{"code":"SS8NLW","label":"IBM Watson Explorer"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"11.0.0.1","Edition":"Advanced","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

More support for:
IBM Watson Explorer

Software version:
11.0.0.1

Operating system(s):
Linux

Document number:
617031

Modified date:
17 June 2018

UID

swg27047247

Manage My Notification Subscriptions