Text search services for a P8 domain

Use Elasticsearch services or IBM® Content Search Services to configure full-text searches for objects.

Overview of Elasticsearch

The IBM FileNet® P8 platform supports the use of Elasticsearch services to search for text in documents and string metadata.

Note: In the Administration Console for Content Engine (ACCE) and in this documentation, the term Elasticsearch refers to both Elasticsearch and OpenSearch services.

A Content Platform Engine domain can support both the IBM Content Search Services feature and the Elasticsearch feature. However, a domain can support only one Elasticsearch content-based search cluster. An object store can support either an Elasticsearch content-based search or IBM Content Search Services, but not both. Elasticsearch can be disabled or enabled at the object store level as well.

Benefits of using Elasticsearch are as follows:
  • Elasticsearch provides improved operational characteristics.
  • Elasticsearch provides a modern indexing and search solution with high availability, scale out, and REST API.
  • No requirement to shutdown for backups.
  • Elasticsearch can be deployed on both cloud SaaS and on-premises.
  • Elasticsearch is released in pair with the current Lucene release.
  • Upgrades to a new version of Elasticsearch can be performed one node at a time, without downtime.
  • More language analyzers and extension mechanisms results in improved search results and support for multi-lingual documents.
  • Reduced complexity of Content Platform Engine indexing and search.
  • You can run the Elasticsearch analytics directly on the underlying indexes.
  • Elasticsearch supports online snapshots, which can be written to a local file system or cloud storage.

The Elasticsearch data node has exclusive access to the shards and replicas that it manages.  This model does not require network mounting of shards or replicas. Elastic search support more efficient direct mounted storage or SAN storage.

The Elasticsearch supports multiple language analyzers. When you configure an object store for indexing, you must select one or more language analyzers for the content that must be indexed for that object store.

For indexing purposes, the Elasticsearch indexing pipeline uses a queue sweep that is called the Elasticsearch Indexing Queue Sweep.

Scale out of Elasticsearch indexes is managed by splitting a single index into multiple shards and distributing the shards across the Elasticsearch data nodes. IBM FileNet P8 does not need to determine how to distribute objects for indexing and does not need to determine which set of indexes it needs to search. Elasticsearch supports replicas of shards, which provide high availability and redundancy of the indexed data.

An index is associated with an index area, and each object store supports a single index area. The index area contains one Elasticsearch index for each enabled root class within the object store. For an object store, instances of the Document, Annotation, Folder, and CustomObject root classes, and their subclasses, can be indexed. For example, the index area includes one index for the Document class and its subclasses, and another for the CustomObject class and its subclasses. To enable indexing for a class, you must configure it for content-based retrieval (CBR). Also, string-valued properties of these classes can also be configured for CBR.

When you create an index area, you need to specify values for the number of shards, number of replicas and the maximum results window properties of the index area:
  • Number of shards

    The desired number of shards depends on the volume of data to be indexed and the number of data nodes. More the data indexed, more shards are needed, and commonly, the number of shards is evenly divisible by the number of data nodes. An index with a large volume of indexed objects needs more shards than an index with a small volume of indexed objects. Shards are expensive for Elasticsearch to manage, as, altering the number of shards impacts indexing performance.

  • Number of replicas

    The desired number of replicas depends on the number of data nodes, and the high availability and data recovery requirements of the system. For example, with 3 data nodes, having 2 replicas of each shard allows the system to operate with only 1 data node, and allows a corrupted shard to be recovered from 2 locations. You can also choose to fully replicate a business critical index than a non-critical index. For example, you may have 6 data nodes, with document search as a business critical operation. In that case, you can have 5 replicas of the Document index, but only 1 replica of each of the other 3 indexes.

  • Maximum results window

    Increasing this property value impacts non-continuous CBR queries. It is used to increase the maximum number of hits when the Content Engine loads to the temp table, thereby, increasing the probability of fully satisfying a query.

    Note: Altering the maximum results window property value for performance reasons by setting the value lower or higher that the default of 10,000 causes queries to take slightly longer to execute.

Indexing is done automatically for all CBR-enabled objects and properties. An index sweep creates queue entries when a class is first enabled for indexing, or when a reindex request is made. The index entries are processed by the indexing queue sweep that uses the Elasticsearch REST API. Because the indexing operation is a batched, asynchronous operation, its results are not immediately evident.

Client applications submit full-text search requests, which are processed by the search servers. Client applications include the administration console (if searching a single object store), vendor clients such as reporting tools, and custom applications that use the Content Engine API.

Overview of Content Search Services

IBM Content Search Services consists of index and search servers that are created and registered at the FileNet P8 domain level. The Content Platform Engine’s text search dispatcher passes index and search requests to Content Search Services servers. The text search dispatcher is configured in the text search subsystem.

The servers in a domain are available to all object stores in the domain. For multiple sites in a domain, servers in a site can access only the object stores that belong to the same site. The individual servers in a domain can be enabled or disabled, impacting all of the objects stores that they service. Content Search Services can be disabled or enabled at the object store level as well.

Content Search Services supports multiple languages. When configuring an object store for indexing, you must select the languages of the content that will be indexed for that object store.

IBM FileNet P8 manages the scale out of Content Search Services indexes. Indexes can be scaled out as follows:
  • Creation of multiple Index Areas (round robin distribution)
  • Rollover of index based on size
  • Index partitioning based on a property of an index enabled class

IBM FileNet P8 determines which index is used for an object during indexing and which indexes are used during search. An index is a single point of failure for Content Search Services. if an index becomes corrupted, the index either must be recovered from backup (and reconciled with currently indexed objects) or the data must be reindexed.

An index belongs to an index area, and an index area is dedicated to a single object store. An index area is a file system directory that contains the information necessary to perform full-text indexing that is updated and queried by Content Search Services. An object store can have one or more index areas. You can have multiple index areas for an object store on a single file system, or you can distribute multiple index areas across file systems for an object store.

Index areas can be placed in affinity groups. An affinity group associates index servers with index areas so that the servers can only access the index areas in the group. Index areas can also have property partitions. Property partitions group objects into separate indexes in accordance with the value of an object property. These features potentially improve CBR indexing and query performance.

Instances of the Document, Annotation, Folder, and CustomObject classes and subclasses can be indexed. For instances of a class to be indexed, you enable the class for content based retrieval (CBR-enabled). String-valued properties of these classes can be CBR-enabled as well.

Indexing is done automatically for all CBR-enabled objects and properties. Because the indexing operation is a batched, asynchronous operation, its results are not immediately evident.

Client applications submit full-text search requests, which are processed by the search servers. Client applications include the administration console (if searching a single object store), vendor clients such as reporting tools, and custom applications that use the Content Engine API.

Server modes

The IBM Content Search Services servers that are registered with a FileNet P8 domain operate in one of the following modes: index mode, search mode, or dual-mode (index and search mode). A server that operates in index mode is called an index server: the server writes index information for objects to indexes. A server that operates in search mode is called a search server: the server reads the indexes to run full-text searches.

An index belongs to an index area, and an index area belongs to an object store. A server accesses indexes only for object stores that belong to the same FileNet P8 site as the server.

  • Index servers

    An index file can be accessed by only one index server at a time. Content Platform Engine assigns indexes to the available index servers so that the work load is roughly the same for all servers. If an index area belongs to an affinity group, the indexes in that area are assigned to servers that belong to the same group.

  • Search servers

    When a CBR query is submitted, Content Platform Engine selects a search server to run the full-text search for the query. (A full-text search expression is specified as part of the CONTAINS function call in the CBR query.) The default search server selected that is the dual-mode server that most recently updated the index that is relevant for the full-text search. The default server might not be eligible to run the search for the following reasons:

    • No server is configured for dual-mode operation.
    • The index server to most recently update the relevant index is not configured for dual-mode operation.
    • The default server is unavailable. For example, the server might be disabled or busy with an index request or with another search request.

    If the default server is not eligible to run the search, Content Platform Engine assigns the search request to a randomly selected available search server. The random selection is repeated if the selected search server crashes while running the search.