Configuring full-text index settings

Use the full-text index settings feature to customize your full-text index.

About this task

Before you configure or search the full-text index, consider the following situations:
  • Full-text filters that contain words might not return all instances of those words: You can limit full-text indexing for words that are based on their length. For example, if you choose to full-text index words limited to 50 characters, then no words greater than 50 characters are indexed.
  • Full-text filters that contain numbers might not return all instances of those numbers: This situation can occur when number searches are configured as follows:
    • The length of numbers to full-text index was defined. If you configure the full-text filter to index numbers with 3 digits or more and try to index the numbers 9, 99, 999, and the word stock, only the number 999 and the word stock are indexed. The numbers 9 and 99 are not indexed.
    • Number indexing in data objects that are limited by file extensions. For example, if you choose to full-text index the number 999 when it appears in data objects with the file extensions .XLS and .DOC, then a full-text filter returns only those instances of the number 999 that exist in data objects with the file extensions .XLS and .DOC. Although the number 999 can exist in other data objects that are harvested, these data objects do not have the file extensions .XLS or .DOC.

Procedure

  1. Go to Administration > Configuration > Application > Full-text settings.
  2. To configure Limits:
    1. Do not limit the length of the words that are indexed: Select this option to have no limits on the length of words that are indexed.
    2. Limit the length of words indexed to___characters: Select this option to limit the length of words that are indexed. Enter the maximum number of characters at which to index words. Words with more characters than the specified amount are not indexed.
  3. To configure Numbers:
    • Do not include numbers in the full-text index: Select this option to have no indexed numbers. This option is selected by default.
    • Include numbers in the full-text index: Select this option to have numbers to be indexed.
    • Include numbers in full-text index but limit them by: Select this option to have only certain numbers indexed. Define these limits as follows:
      • Number length: Include only numbers that are longer than ____ characters. Enter the number of characters a number must contain to be indexed. The Number length feature indexes longer numbers and ignores shorter numbers. By not indexing shorter numbers, such as one- and two-character numbers, you can focus your filter on meaningful numbers. These numbers can be account numbers, Social Security numbers, credit card numbers, license plate numbers, or telephone numbers.
      • Extensions: Index numbers that are based on the file extensions of the data objects in which they appear. Select Limit numbers for all extensions to limit numbers in all file extensions to the character limits set in Number length. Alternatively, select Limit numbers for these extensions to limit the numbers that are selected in Numbers length only to data objects with certain file extensions. Enter the file extensions one per line that must have limited number indexing. Any data object with a file extension that is not listed has all indexed numbers.
    For sensitive data detection, especially in spreadsheet files, make sure numbers are included in the full-text index.
  4. To configure Include word lemmas in index, select whether to identify and index the lexical forms of words as well.
    For example, employ is the lemma for words such as employed, employment, employs. If you use lemmas and search for the word employed, IBM® StoredIQ® denotes any found instances of employment, employ, employee, and so on, when it views the data object.
    • Do not include word lemmas in index (faster indexing): By not indexing lemmas, data sources are indexed slightly faster and the index size on disk is smaller.
    • Include word lemmas in index (improved searching): By indexing lemmas, filter results can be more accurate, although somewhat slower. Without lemmas, a filter for trade would need to be written as trade, trades, trading, or traded to get the same effect, and even then a user might miss an interesting variant.
  5. Configure Stop words.
    Stop words are common words that are found in data objects and are indexed like other words. This allows users to find instances of these words where it matters most. A typical example would be a search expression of 'to be or not to be' (the single quotation marks are a specific usage here). Typically, IBM StoredIQ ignores stop words in search expressions, but because single quotation marks as syntax elements, a user can find Shakespeare's "Hamlet." Indexing stop words slightly increases the amount of required storage space, but relevant documents might be missed without these words present in the index. By default, the following words are considered stop words for the English language: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.

    To add a stop word, enter one word per line, without punctuation, which includes hyphens and apostrophes.

    Note: As of the IBM StoredIQ 7.6.0.3 release, stop words on the configuration page are for the English language only.
  6. Select Enable OCR image processing to control at a global level whether Optical Character Recognition (OCR) processing is attempted on image files, such as PNG, BMP, GIF, TIFF, and images that are produced from scanned image PDF files. A scanned image PDF file is a PDF file with a document that is scanned into it. Through OCR processing, the images are extracted from scanned image PDF files and texts are extracted from the image files.

    The quality of the text extraction relies on the resolution setting on the image files and images from scanned image PDF files. Thus, the resolution setting must be at 300 dots per inch (DPI) or higher. For text in images that is rotated, small font, or unclear text cannot be extracted.

    If you select this option, you must restart services. See Restarting and Rebooting the Appliance.

  7. Select Always process PDFs for images to control at a global level to extract text from scanned image PDF files.
    A scanned image PDF is a special type of PDF that is created by scanning a document into PDF and is different from a normal PDF. A scanned image PDF contains one image per entire page and no other elements such as plain text. In contrast, a normal PDF can contain a mix of plain-text elements, embedded objects, and images per page. Text extraction from a scanned image PDF is processing intensive and involves two steps:
    1. Retrieving images from scanned image PDF
    2. Extracting text from the retrieved images

    To identify a PDF as a scanned image PDF and then extract text from it, you must select both the Enable OCR image processing option and the Always process PDFs for images option. However, to extract text from image files such as PNG, BMP, GIF, and TIFF, you need to select only the Enable OCR image processing option (as described in step 6) because only step b needs to be performed on these files.

    You can set a maximum number of images for processing by entering the respective count for the Limit number of images in scanned image PDF to option. However, this setting does affect text extraction only. The default value is zero, which means that text is extracted from all images that are retrieved from scanned image PDFs.

    If you select this option, you must restart services. See Restarting and Rebooting the Appliance.

  8. Click OK.