Product Documentation
Abstract
Watson Explorer Advanced Edition Version 11.0.0.1 includes a toolkit for importing HDFS files into a collection that is configured to support Apache Hadoop.
Content
The Hadoop Distributed File System (HDFS) usually contains a large number of files and a large amount of data. When importing HDFS files in previous versions of Watson Explorer Content Analytics, each file was treated as single document. The new Hadoop import toolkit speeds up the import process by using the Hadoop Map/Reduce interface class, an interface that is supported by many programs that support Apache Hadoop.
You can use the toolkit to create custom code for generating raw data store (RDS) files from various input formats. The generated RDS files can be directly imported into a Hadoop-enabled collection and indexed for content analytics.
Important: The imported files are moved into the RDS directory when the import request is processed. Documents previously imported by this command into the same crawl space are removed before new documents are imported.
Sample code is provided in the ES_INSTALL_ROOT/samples/hadoop_toolkit directory. The sample code processes tab-delimited CSV files on HDFS and generates RDS files.
The sample code requires Java SDK and Apache Ant. To do any of the following steps, you must log in as the Content Analytics administrator.
Build the sample code:
1. Set up the Hadoop environment:
- escrbi.sh
- cd $ES_INSTALL_ROOT/samples/hadoop_toolkit
- ant
Use the sample code to import files:
1. Create a collection, for example, hadoop_toolkit.
2. Import index fields from the $ES_INSTALL_ROOT/samples/hadoop_toolkit/index_fields.xml file.
3. Run map reduce:
- hadoop jar $ES_INSTALL_ROOT/samples/hadoop_toolkit/build/toolkit_sample.jar <collection ID>
- The input file "esdata/sample/input/sample.csv" will be created on HDFS.
RDS files will be created in the esdata/sample/output directory on HDFS.
- $ esadmin importer importExternalRDS -cid hadoop_toolkit -url esdata/sample/output
Results similar to the following messages are logged:
FFQC5303I Importer Service (node1) (sid: importer) CCL session exists. PID: 54036
FFQC5314I The following result occurred: hdfs://...user/esadmin/esdata/sample/output
5. The pending job can be listed as follows:
- $ esadmin importer listPendingExternalRDS -cid hadoop_toolkit
- The output shows the job property. The key and value are the request time (in epoch time) and the path of the RDS directory. For example:
- FFQC5303I Importer Service (node1) (sid: importer) CCL session exists. PID: 54036
FFQC5314I The following result occurred: #Wed Sep 16 11:31:02 JST 2015
#Wed Sep 16 11:31:02 JST 2015
1442370395033=hdfs\://.../user/esadmin/esdata/sample/output
Was this topic helpful?
Document Information
More support for:
IBM Watson Explorer
Software version:
11.0.0.1
Operating system(s):
Linux
Document number:
617031
Modified date:
17 June 2018
UID
swg27047247