Text search services for a Content Cortex domain

The topic provides an overview of the two text search services available for Content Platform Engine: Elasticsearch (or OpenSearch) and Content Search Services.

Note:

IBM® Content Search Services is deprecated from IBM Content Cortex V5.7.0. It is recommended that you plan to migrate to another search service such as Elasticsearch or OpenSearch.

You can use Elasticsearch or OpenSearch services or IBM Content Search Services to configure full-text searches for objects. Understanding the architecture, capabilities, and differences between these services helps you plan your search configuration and migration strategy.

Overview of Elasticsearch or OpenSearch services

The IBM Content Cortex platform supports the use of Elasticsearch and OpenSearch services to search for text in documents and string metadata.

Note: In the Administration Console for Content Engine (ACCE) and in this documentation, the term Elasticsearch refers to both Elasticsearch and OpenSearch services.

Elasticsearch can be disabled or enabled at the object store level.

Elasticsearch provides the following benefits:

Elasticsearch provides improved operational characteristics.
Elasticsearch provides a modern indexing and search solution with high availability, scale out, and REST API.
No requirement to shutdown for backups.
Elasticsearch can be deployed on both cloud SaaS and on-premises.
Elasticsearch is released in pair with the current Lucene release.
Upgrades to a new version of Elasticsearch can be performed one node at a time, without downtime.
More language analyzers and extension mechanisms results in improved search results and support for multi-lingual documents.
Reduced complexity of Content Platform Engine indexing and search.
You can run the Elasticsearch analytics directly on the underlying indexes.
Elasticsearch supports online snapshots, which can be written to a local file system or cloud storage.

The Elasticsearch data node has exclusive access to the shards and replicas that it manages. This model does not require network mounting of shards or replicas. Elasticsearch supports more efficient direct mounted storage or SAN storage.

Elasticsearch supports multiple language analyzers. When you configure an object store for indexing, you must select one or more language analyzers for the content that must be indexed for that object store.

For indexing purposes, the Elasticsearch indexing pipeline uses a queue sweep called the Elasticsearch Indexing Queue Sweep.

To scale indexes, Elasticsearch splits a single index into multiple shards and distributes the shards to Elasticsearch data nodes. IBM Content Cortex does not need to determine how to distribute objects for indexing and does not need to determine which set of indexes it needs to search. Elasticsearch supports replicas of shards, which provide high availability and redundancy of the indexed data.

An index is associated with an index area, and each object store supports a single index area. The index area contains one Elasticsearch index for each enabled root class within the object store. For an object store, instances of the Document, Annotation, Folder, and CustomObject root classes, and their subclasses, can be indexed. For example, the index area includes one index for the Document class and its subclasses, and another for the CustomObject class and its subclasses. To enable indexing for a class, you must configure it for content-based retrieval (CBR). Also, string-valued properties of these classes can also be configured for CBR.

When you create an index area, you need to specify values for the number of shards, number of replicas and the maximum results window properties of the index area:

Number of shards
The desired number of shards depends on the volume of data to be indexed and the number of data nodes. More the data indexed, more shards are needed, and commonly, the number of shards is evenly divisible by the number of data nodes. An index with a large volume of indexed objects needs more shards than an index with a small volume of indexed objects. Shards are expensive for Elasticsearch to manage, as, altering the number of shards impacts indexing performance.
Number of replicas
The desired number of replicas depends on the number of data nodes, and the high availability and data recovery requirements of the system. For example, with 3 data nodes, 2 replicas of each shard enable operation with only 1 data node and recovery of corrupted shards from 2 locations. You can also choose to fully replicate a business-critical index than a non-critical index. For example, you may have 6 data nodes, with document search as a business critical operation. In that case, you can have 5 replicas of the Document index, but only 1 replica of each of the other 3 indexes.
Maximum results window
Increasing this property value impacts non-continuous CBR queries. It is used to increase the maximum number of hits when the Content Engine loads to the temp table, thereby, increasing the probability of fully satisfying a query.

Note: Altering the maximum results window property value for performance reasons by setting the value lower or higher that the default of 10,000 causes queries to take slightly longer to execute.

Indexing is done automatically for all CBR-enabled objects and properties. An index sweep creates queue entries when a class is first enabled for indexing, or when a reindex request is made. The index entries are processed by the indexing queue sweep that uses the Elasticsearch REST API. Because the indexing operation is a batched, asynchronous operation, its results are not immediately evident.

Client applications submit full-text search requests, which are processed by the search servers. Client applications include the administration console (if searching a single object store), vendor clients such as reporting tools, and custom applications that use the Content Engine API.

Overview of Content Search Services

IBM Content Search Services consists of index and search servers that are created and registered at the Content Cortex domain level. The Content Platform Engine's text search dispatcher passes index and search requests to Content Search Services servers. The text search dispatcher is configured in the text search subsystem.

The servers in a domain are available to all object stores in the domain. For multiple sites in a domain, servers in a site can access only the object stores that belong to the same site. The individual servers in a domain can be enabled or disabled, impacting all of the objects stores that they service. Content Search Services can be disabled or enabled at the object store level as well.

Content Search Services supports multiple languages. When configuring an object store for indexing, you must select the languages of the content that will be indexed for that object store.

IBM Content Cortex manages the scale out of Content Search Services indexes. Indexes can be scaled out as follows:

Creation of multiple Index Areas (round robin distribution)
Rollover of index based on size
Index partitioning based on a property of an index enabled class

IBM Content Cortex determines which index is used for an object during indexing and which indexes are used during search. An index is a single point of failure for Content Search Services. If an index becomes corrupted, the index either must be recovered from backup (and reconciled with currently indexed objects) or the data must be reindexed.

An index belongs to an index area, and an index area is dedicated to a single object store. An index area is a file system directory that contains the information necessary to perform full-text indexing that is updated and queried by Content Search Services. An object store can have one or more index areas. You can have multiple index areas for an object store on a single file system, or you can distribute multiple index areas across file systems for an object store.

Index areas can be placed in affinity groups. An affinity group associates index servers with index areas so that the servers can only access the index areas in the group. Index areas can also have property partitions. Property partitions group objects into separate indexes in accordance with the value of an object property. These features potentially improve CBR indexing and query performance.

Instances of the Document, Annotation, Folder, and CustomObject classes and subclasses can be indexed. For instances of a class to be indexed, you enable the class for content based retrieval (CBR-enabled). String-valued properties of these classes can be CBR-enabled as well.

Indexing is done automatically for all CBR-enabled objects and properties. Because the indexing operation is a batched, asynchronous operation, its results are not immediately evident.

Server modes

The IBM Content Search Services servers that are registered with a Content Cortex domain operate in one of the following modes: index mode, search mode, or dual-mode (index and search mode). A server that operates in index mode is called an index server: the server writes index information for objects to indexes. A server that operates in search mode is called a search server: the server reads the indexes to run full-text searches.

An index belongs to an index area, and an index area belongs to an object store. A server accesses indexes only for object stores that belong to the same Content Cortex site as the server.

Index servers
An index file can be accessed by only one index server at a time. Content Platform Engine assigns indexes to the available index servers so that the work load is roughly the same for all servers. If an index area belongs to an affinity group, the indexes in that area are assigned to servers that belong to the same group.
Search servers
When a CBR query is submitted, Content Platform Engine selects a search server to run the full-text search for the query. (A full-text search expression is specified as part of the CONTAINS function call in the CBR query.) The default search server selected that is the dual-mode server that most recently updated the index that is relevant for the full-text search. The default server might not be eligible to run the search for the following reasons:
- No server is configured for dual-mode operation.
- The index server to most recently update the relevant index is not configured for dual-mode operation.
- The default server is unavailable. For example, the server might be disabled or busy with an index request or with another search request.
If the default server is not eligible to run the search, Content Platform Engine assigns the search request to a randomly selected available search server. The random selection is repeated if the selected search server crashes while running the search.

Migration from Content Search Services to Elasticsearch or OpenSearch

If you are migrating from Content Search Services to Elasticsearch or OpenSearch, you can use dual mode indexing to index content to both engines simultaneously during the migration period. Dual mode indexing allows you to validate Elasticsearch or OpenSearch functionality without disrupting Content Search Services search capabilities. The migration process includes mechanisms to migrate previously indexed documents from Content Search Services to Elasticsearch or OpenSearch. After all documents are indexed and you verify that search functionality meets your requirements, you can transition to Elasticsearch-only or OpenSearch-only mode and decommission Content Search Services.