Toolkit for importing HDFS files into Hadoop-enabled collections

Product Documentation

Abstract

Watson Explorer Advanced Edition Version 11.0.0.1 includes a toolkit for importing HDFS files into a collection that is configured to support Apache Hadoop.

Content

The Hadoop Distributed File System (HDFS) usually contains a large number of files and a large amount of data. When importing HDFS files in previous versions of Watson Explorer Content Analytics, each file was treated as single document. The new Hadoop import toolkit speeds up the import process by using the Hadoop Map/Reduce interface class, an interface that is supported by many programs that support Apache Hadoop.

You can use the toolkit to create custom code for generating raw data store (RDS) files from various input formats. The generated RDS files can be directly imported into a Hadoop-enabled collection and indexed for content analytics.

Important: The imported files are moved into the RDS directory when the import request is processed. Documents previously imported by this command into the same crawl space are removed before new documents are imported.

Sample code is provided in the ES_INSTALL_ROOT/samples/hadoop_toolkit directory. The sample code processes tab-delimited CSV files on HDFS and generates RDS files.

The sample code requires Java SDK and Apache Ant. To do any of the following steps, you must log in as the Content Analytics administrator.

Build the sample code:

1. Set up the Hadoop environment:

escrbi.sh 2. Go to the toolkit sample directory:

cd $ES_INSTALL_ROOT/samples/hadoop_toolkit 3. Build the sample code:

ant
Use the sample code to import files:

1. Create a collection, for example, hadoop_toolkit.
2. Import index fields from the $ES_INSTALL_ROOT/samples/hadoop_toolkit/index_fields.xml file.
3. Run map reduce:

hadoop jar $ES_INSTALL_ROOT/samples/hadoop_toolkit/build/toolkit_sample.jar <collection ID>

4. Import the RDS files into a collection:

5. The pending job can be listed as follows:

$ esadmin importer listPendingExternalRDS -cid hadoop_toolkit

The output shows the job property. The key and value are the request time (in epoch time) and the path of the RDS directory. For example:

[{"Product":{"code":"SS8NLW","label":"IBM Watson Explorer"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF016","label":"Linux"}],"Version":"11.0.0.1","Edition":"Advanced","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Was this topic helpful?

Document Information

More support for:
IBM Watson Explorer

Software version:
11.0.0.1

Operating system(s):
Linux

Document number:
617031

Modified date:
17 June 2018

UID

swg27047247

IBM Support

Tips