Setting Content Extractor properties

Configure the Content Extractor by editing the ExtractorP8.properties file.

About this task

You must set the following properties before you run the Content Extractor:
  • Ensure that the directory where output is written, which you identify in the XmlDirectory property, exists and that it is empty. The default value is extractorOutput. Also, ensure that the binaryOutputDirectory directory exists under the XML output directory and that it is empty.
  • Ensure that the Path_n properties specify the paths of IBM® FileNet® Content Manager folders or document classes from which to extract content.
When determining which documents to extract, identify the most relevant content and specify properties to include or exclude the content appropriately, for example:
  • Specify the paths for the IBM FileNet Content Manager folders that contain the content that must be extracted in the format object_store_name/directory: Path_1=JKEnterprises/Accounting
  • Specify the paths for folders with content that must not be extracted: IgnorePath_1=JKEnterprises/Accounting/AccountsPayable
  • Specify that only documents that have specific document properties must be extracted: With_1=DocumentClass=Confidential
  • Specify that documents with certain document properties must not be extracted: Without_1=DocumentClass=Holidays
  • Specify that only documents that were modified after a specific date must be extracted: Date=14-Jul-2008
  • Specify that documents that contain only metadata must be extracted: ExtractEmptyBody=true
Tips:
  • Start of change When you specify the object store name, use the display name rather than the symbolic name of the object store. For example, if the display name of the object store is Test File Plan, and its symbolic name is TestFilePlan, specify the following value for the Path_1 property in the ExtractorP8.properties file: Path_1 = Test File Plan. If you specify the symbolic name, content is not extracted. End of change
  • IBM FileNet Content Manager folder names, document class names, and document properties are case-sensitive. If you specify the names incorrectly, content is not extracted.

The properties that you specify are handled as Boolean AND operations. For example, you can extract documents from a specific path that have specific document properties and that were modified after a specific date.

When you gather sample content, you can control how much data is extracted from each folder, for example:
  • Specify the minimum number of documents to extract from each folder: FolderMin=20.
  • Specify the maximum number of documents to extract from each folder: FolderMax=100.
  • Specify the percentage of eligible documents to extract from each folder: FolderFraction=0.20. When this value is less than 100% (1.0), and the folder contains more than the minimum number of documents to be extracted, the Content Extractor randomly selects documents from the folder until the maximum number of documents to be extracted is reached.
Beginning with IBM Content Classification version 8.6, the Content Extractor no longer includes filtered text in the XML output. The following properties in the Extractor.properties file are ignored, if they are present:
  • StellentPath
  • IcmUrl
  • KbName
  • FileMax

You can use the following properties to process document properties and attachments to extracted documents:

propertyList
Specifies a comma-delimited list of IBM FileNet Content Manager properties to extract from documents. If this property is present, then only the listed properties are returned in the extracted XML output. If this property is not present, then all IBM FileNet Content Manager properties are returned in the extracted XML output. You might want list specific properties to ensure that only the properties that contain data that is relevant for classification are extracted, for example:
propertyList=ICM_topFolder, ICM_details
binaryOutputDirectory
Specifies the path for an existing directory where the binary output is to be written. This directory must exist under the XML output directory and it must be empty before you run the Content Extractor.

To set Content Extractor properties:

Procedure

  1. Open the ExtractorP8.properties file with a text editor. This file is installed in the Classification_Home/ECMTools/conf/FileNetP8 directory.
  2. Edit the properties and save your changes. For descriptions of each property, see the detailed comments in the property file.