About this task
You must set the following properties before you run the
Content Extractor:
- Ensure that the directory where output is written, which you identify
in the XmlDirectory property, exists and that it
is empty. The default value is extractorOutput. Also,
ensure that the binaryOutputDirectory directory
exists under the XML output directory and that it is empty.
- Ensure that the Path_n properties
specify the paths of IBM® FileNet® Content Manager folders
or document classes from which to extract content.
When determining which documents to extract, identify
the most relevant content and specify properties to include or exclude
the content appropriately, for example:
- Specify the paths for the IBM FileNet Content Manager folders
that contain the content that must be extracted in the format object_store_name/directory: Path_1=JKEnterprises/Accounting
- Specify the paths for folders with content that must not be extracted: IgnorePath_1=JKEnterprises/Accounting/AccountsPayable
- Specify that only documents that have specific document properties
must be extracted: With_1=DocumentClass=Confidential
- Specify that documents with certain document properties must not
be extracted: Without_1=DocumentClass=Holidays
- Specify that only documents that were modified after a specific
date must be extracted: Date=14-Jul-2008
- Specify that documents that contain only metadata must be extracted: ExtractEmptyBody=true
Tips:
When you specify the object store name, use the display
name rather than the symbolic name of the object store. For example,
if the display name of the object store is Test File Plan, and its
symbolic name is TestFilePlan, specify the following value for the Path_1 property
in the ExtractorP8.properties file: Path_1
= Test File Plan. If you specify the symbolic name, content
is not extracted. 
- IBM FileNet Content Manager folder names,
document class names, and document properties are case-sensitive.
If you specify the names incorrectly, content is not extracted.
The properties that you specify are handled as Boolean AND
operations. For example, you can extract documents from a specific
path that have specific document properties and that were modified
after a specific date.
When
you gather sample content, you can control how much data is extracted
from each folder, for example:
- Specify the minimum number of documents to extract from each folder: FolderMin=20.
- Specify the maximum number of documents to extract from each folder: FolderMax=100.
- Specify the percentage of eligible documents to extract from each
folder: FolderFraction=0.20. When this value is less
than 100% (1.0), and the folder contains more than the minimum number
of documents to be extracted, the Content Extractor randomly selects documents
from the folder until the maximum number of documents to be extracted
is reached.
Beginning with
IBM Content
Classification version
8.6, the
Content Extractor no
longer includes filtered text in the XML output. The following properties
in the
Extractor.properties file are ignored,
if they are present:
- StellentPath
- IcmUrl
- KbName
- FileMax
You can use the following properties to process document
properties and attachments to extracted documents:
- propertyList
- Specifies a comma-delimited list of IBM FileNet Content Manager properties to extract from
documents. If this property is present, then only the listed properties
are returned in the extracted XML output. If this property is not
present, then all IBM FileNet Content Manager properties
are returned in the extracted XML output. You might want list specific
properties to ensure that only the properties that contain data that
is relevant for classification are extracted, for example:
propertyList=ICM_topFolder, ICM_details
- binaryOutputDirectory
- Specifies the path for an existing directory where the binary
output is to be written. This directory must exist under the XML output
directory and it must be empty before you run the Content Extractor.
To set Content Extractor properties: