Tuning Elasticsearch indexing queue sweep

You can adjust the values for an Elasticsearch indexing queue sweep to improve the indexing performance for CBR-enabled objects.

About this task

Setting the maximum workers for Elasticsearch indexing queue sweep

The indexing queue sweep is created automatically during Elasticsearch indexing. The sweep is defined with a default of eight workers. Sometimes, you might need to increase the number of workers to improve the indexing throughput. Sometimes, an increase in the indexing queue workers improves indexing throughput, but has a negative impact on other Content Platform Engine operations. You can increase the number of workers incrementally, while you monitor the system by using the System Dashboard or with other tools that can monitor the Content Platform Engine system resources. To increase the number of Elasticsearch indexing queue sweep, follow the steps:

In the object store navigation pane, click the name of the object store (the top-level item).
Select Object Store > Sweep Management > Queue Sweeps > Elasticsearch Indexing Queue Sweep > Properties > Maximum Sweep Workers. Enter the new value for maximum workers for Elasticsearch indexing queue sweep.

Changing the bulk batch size

The indexing rate can be affected by tuning the BulkAPI batch size. The BulkAPI batch size controls the size of batches that Content Platform Engine submits for indexing to the Elasticsearch cluster. The default BulkAPI batch size is 40 documents. Alter the setting using the ecm.elasticsearch.bulkapi.max.batch.size JVM argument.

Selecting the number of shards for an index area

The Content Platform Engine takes advantage of Elasticsearch index sharding to achieve index scale-out. It is important to use enough shards when you create the index area, as the number of shards cannot be changed without performing a full reindex. The Elasticsearch administrator needs to determine the correct number of shards for the index area based on predicted ingestion volume, number of shards the cluster can manage. The number of replicas needs to be considered when you create the index area, since the replicas provide high availability for the indexed data.

Indexing is quicker when you have more shards, as each document needs to be stored only once per shard. If fast ingestion is the major concern, you need to have at least one shard per data node. The ratio of shards to nodes is important. Indexing work is sent to the shards in a round-robin fashion. For best performance, the number of shards need to be a multiple of the number of nodes. For instance, if you have three data nodes, you need to configure three, six, or nine shards. If the number of shards is not a multiple of the number of nodes, performance degrades because the workload is not spread out evenly over the nodes.

What to do next

After tuning Elasticsearch indexing performance, you can optionally configure dual mode indexing to run both Content Search Services and Elasticsearch simultaneously during migration. For more information, see Indexing object text in dual mode.