Configuring harvester settings

You can use several different harvester settings to fine-tune your index process.

Procedure

  1. Go to Administration > Configuration > Application > Harvester settings.
  2. To configure Basic settings, follow these steps:
    1. Harvester Processes: Select either Content processing or System metadata only.
    2. Harvest miscellaneous email items: Select to harvest contacts, calendar appointments, notes, and tasks from the Exchange server.
    3. Harvest non-standard Exchange message classes: Select to harvest message classes that do not represent standard Exchange email and miscellaneous items.
    4. Include extended characters in object names: Select to allow extended characters to be included in data object names during a harvest.
    5. Determine whether data objects have NSRL digital signature: Select to check data objects for NSRL digital signatures.
    6. Enable parallel grazing: Select to harvest volumes that were already harvested and are going to be reharvested.
      If the harvest completes normally, parallelized grazing enables harvests to begin where they left off when interrupted and to start at the beginning.
    7. Index generated text: Select for the generated text that is extracted by OutsideIn to be indexed and available for full-text search.
      For sensitive data detection, especially in spreadsheet files, this option must be enabled.
  3. Specify Skip Content processing.
    In Data object extensions to be skipped, specify those file types that you want the harvest to ignore by adding data object extensions to be skipped.
  4. To configure Locations to ignore, enter each directory that must be skipped. IBM® StoredIQ® accepts only one entry per line and that regular expressions can be used.
  5. To configure Limits, follow these steps:
    1. Maximum data object size: Specify the maximum data object size to be processed during a harvest.
      During a harvest, files that exceed the maximum data object size are not read. As a result, if full-text/content processing is enabled for the volume, they are audited as skipped: Configured max. object size. These objects still appear in the volume cluster along with all file system metadata. Since they were not read, the hash is a hash of the file-path and size of the object, regardless of what the hash settings are for the volume (full/partial/off).
    2. Max entity values per entity: For any entity type (date, city, address and the like), the system records, per data object, the number of values set in this field.
      The values do not need to be unique. For example, if the maximum value is 1,000, and the harvester collects 1,000 instances of the same date (8/15/2009) in a Word document, the system stops counting dates. This setting applies to all user-defined expressions (keyword, regular expression, scoped, and proximity) and all standard attributes.
    3. Max entity values per data object: Across all entity types, the total (cumulative) number of values that is collected from a data object during a harvest. A 0 in this field means "unlimited".
      This setting applies to all user-defined expressions (key-word, regular expression, scoped, and proximity) and all standard attributes.
  6. Configure Binary Processing.
    1. Run binary processing when text processing fails: Select this option to run binary processing.
      The system runs further processes against content that failed in the harvesting. You can select options for when to start this extended processing and how to scan content. Binary processing does not search image file types such as .GIF and .JPG for text extraction.
    2. Failure reasons to begin binary processing: Select the checkboxes of the options that define when to start extended processing.
      Binary processing can enact in extracting text from a file failure in these situations:
      • when the format of the file is unknown to the system parameters;
      • when the data object type is not supported by the harvester scan;
      • when the data object format does not contain actual text.
    3. Data object extensions: Set binary processing to process all data files or only files of entered extensions. To add extensions, enter one per line without a period.
    4. Text encoding: Set options for what data to scan and extract at the start of binary processing.
      This extended processing can accept extended characters and UTF-16 and UTF-32 encoded characters as text. The system searches UTF-16 and UTF-32 by default.
    5. Minimums: Set the minimum required number of located, consecutive characters to begin processing for text extraction.
      For example, if you enter 4, the system begins text processing when four consecutive characters of a particular select text encoding are found. This setting helps find and extract helpful data from the binary processing, reducing the number of false positives.
  7. Click OK.
    Changes to harvester settings do not take effect until the appliance is rebooted or the application services are restarted.