Exporting crawled or analyzed documents

To use information from Watson Explorer Content Analytics for other purposes, such as data warehouse, business intelligence, or classification applications, you can export documents that have been crawled or documents that have been crawled and analyzed, and then import the exported data into your applications.

About this task

How and when documents are exported depends on how content is added to the collection and whether the collection uses a document cache. Documents are exported when the parse and index services run. The parse and index services start when content is added to the collection or when the index is rebuilt.

Crawled documents

How the crawler is configured to run also controls how new, changed, and deleted documents are exported. The first time that the crawler crawls a data source, all documents are crawled. In subsequent crawls:

If the crawler is configured to crawl all updates, then the crawler checks for new, changed, and deleted documents. The export program exports the new and changed documents. You can configure an option to export information about deleted documents. In this case, when you export documents as XML, an XML file is created for each deleted document. In the XML output, the value of the /Document@Type element is DELETED.
If the crawler is configured to crawl new and modified documents only, the crawler does not check for deleted documents and information about the deleted documents is not generated.
If you select the option to crawl new and modified documents only, the crawler looks for documents with modification dates that are later than the previous crawl time. If you copy files to a resource, the modification date might not change, which means that the crawler might not detect that the files were added to the resource. For example, if you copy files to a Windows folder, Windows does not automatically change the modification date of the files. To ensure such files are crawled, select the option to crawl all updates or a full crawl.
If the crawler is configured to do a full crawl, then the entire crawl space is crawled and all documents that match your export criteria are exported, regardless of whether the documents were updated or deleted since the previous crawl.

Imported documents

All documents imported to a collection are passed to the parse and index services. If you configure a collection to export documents, all imported documents will be exported when they are processed by the parse and index services.

Rebuilding the index

If the document cache is enabled for the collection, crawled and imported documents are saved in the cache. When you rebuild the index, documents in the cache are passed to the parse and index services. Thus, documents can be exported by restarting the index build.

If you change the export options, such as enabling the export of analyzed documents as XML files, you must restart the parse and index services to reflect the change. Restarting the parse and index services also initializes some export actions. For example, if the collection is configured to export documents as CSV files, the export process creates a directory and CSV files to save the exported documents.

Exporting to IBM Cognos BI: If you use IBM® Cognos® Business Intelligence, the wizard helps you specify information for exporting documents to a relational database or as comma-separated value (CSV) files. Also see related topics about setting up the integration between Watson Explorer Content Analytics and IBM Cognos BI.

Exporting to IBM DB2: If you plan to export documents to an IBM DB2® database, you must install the DB2 Client on the Watson Explorer Content Analytics server. In a distributed installation, install the DB2 Client on the master server. For the configuration, specify appropriate jar files, such as db2jcc.jar and db2jcc_license_cu.jar, which are installed with the DB2 Client.

Procedure

To export crawled or analyzed documents:

On the Collections view, expand the collection that you want to configure. In the Parse and Index pane, ensure that the parse and index process is running.
Click the icon to export documents and then click Configure options to export crawled or analyzed documents.
On the Options to Export Crawled or Analyzed Documents page, specify options for which documents you want to export and how you want to export them.
Export as XML files
Specify options for exporting documents as XML files:
- For crawled documents, specify whether you want to export document metadata, document content, or both. You must specify the output paths for where the exported data is to be written. The output directories must exist and allow write access.
- For analyzed documents, you can export the analyzed metadata and content. You can also export information that is stored in the common analysis structure (CAS). You must specify the output paths for where the exported data is to be written. The output directories must exist and allow write access.
  When you export analyzed documents as XML and enable the CAS as XMI format export, the information stored in the CAS is converted to XML Metadata Interchange (XMI) format and exported as XMI files. The purpose of this export option is to help you troubleshoot a custom annotator by viewing the contents of the CAS. The XMI format that is used in the CAS is based on the UIMA standard. If the UIMA standard changes in the future, the XMI format might also change.
- Specify URI patterns to identify the documents that you want to export. The URI pattern is a regular expression and you can enter one pattern per line. The regular expression must be interpretable by Java™. For example, to export documents with a URI that ends in .pdf, specify .*.pdf. To export all documents, specify .*. The patterns are evaluated for matches in the order that you list them here.
- For crawled documents, specify whether you want to export documents without analyzing them or adding them to the index. You might select this check box, for example, if you are collecting documents to import into another application.
- For analyzed documents, specify whether you want to export analyzed documents without adding them to the index. You might select this check box, for example, if you are collecting documents or analytical data to import into another application.
- Specify whether you want to export information about documents that were deleted from the crawl space since the crawler last checked for new, changed, and deleted documents.
- Specify whether you want to use the field name or facet path for mapping metadata when the documents are exported. You might select this option, for example, if you plan to import the exported documents into IBM Content Collector.
  This option preserves the original file name extensions when content is exported and allows field names and facet paths to be represented as elements in the exported XML files. If you do not enable this option, then the field names and facet paths are represented as attributes in the exported XML files.
Export to a relational database
The options are similar to those that you specify for exporting documents as XML files, such as specifying URI patterns to identify the documents that you want to export. Differences exist for exporting documents from content analytics collections and enterprise search collections:
- If you are configuring options for a content analytics collection, click Configure and run a wizard to specify information about the target database. Documents will be exported into star-schema tables. The target database must exist.
- If you are configuring options for an enterprise search collection, specify the path for the database mapping file that controls where the exported data is to be written. See the ES-INSTALL_ROOT/default_config/export_rdb_mapping.xml file for a sample mapping file.
Export as CSV files

Click Configure and run a wizard to specify information about the fields that you want to export and where you want to create the CSV files. You must also specify URI patterns to identify the documents that you want to export.

Export with a custom plug-in

Specify the class path, class name, and properties that you want to pass to the custom plug-in.
Stop and restart the parse and index services for the collection.
Either import documents to the collection or configure at least one crawler to crawl the documents that you want to export and start the crawler.
Optional: In the Export area, check the status of the export request. For example, you can see the number of documents that have been exported so far, whether the export request is completed, and whether any errors occurred.