Continuous index/search failure considerations
The search index might occasionally have failures. Users must monitor their search indexes, as the program will function even in the event of a problem.
The search index is non-critical for business functionality. Therefore, OMS will continue to function correctly even during prolonged problems with the search/index operations. However, this may cause the following problems:
- Search APIs, such as getOrderList, might perform badly under certain scenarios. For example, if the search server is unreachable, every ‘search’ call by the Sterling Search Index client will have to timeout first before failing.
- Since an exception is logged through ‘Exception Notification’ for every search/index failure, too many of these problems will be logged continuously.
- If indexing does not work for a long time, but search works, the search results will become unreliable because continuous failure of the indexing operation would have resulted in a stale index.
For these reasons, it makes sense to determine if there is a continuous problem with search/index operations, and disable the operation proactively.
Note that index/search operations may or may not fail together, and this will depend upon the underlying problem, as well as Search Index Server implementation. For problems such as search server unavailability or network outage, both operations will fail together. Yet, there are scenarios where one operation might work, while the other might fail. For example, with an Elasticsearch client, if one or more shards are unavailable, the indexing operation will work fine, but the search operation will fail. In such a case, it is better to disable only the search operation, and let the index operation continue so as to prevent the index from becoming stale. Therefore, it is beneficial to determine which operation is failing and disable that particular operation.
Such a determination is made using the ‘fail fast’ logic. This logic can be enabled/disabled through the ‘yfs.ssi.fail.fast’ yfs.properties_ysc_ext property, and is enabled by default. This logic ensures that upon continuous index/search operation failures, the corresponding operation is disabled until you fix the underlying problem, and re-enable the operation. If there is a continuous failure with the indexing operation, the indexing operation will be disabled, and no more attempts will be made to index. Similarly, if there is continuous failure with the ‘search’ operation, that operation will be disabled.
The definition of what constitutes a ‘continuous’ failure is configurable through yfs.properties_ysc_ext. Refer to all properties starting with yfs.ssi.fail.fast.* for more details about how to configure this feature. Note the following:
- A common YFS_Index_Status table of ‘STATISTICS’ table type is used to track the status of index/search operations using the IndexWorking and SearchWorking flags. When one of these operations is to be disabled, the corresponding flag is set to ‘N’. Note that this data is tracked separately for the Order and Shipment indexes.
- For an index, when indexing is disabled (IndexWorking is set to ‘N’), the search operation is not performed on that index anymore. As mentioned above, this is because if indexing operation has been failing continuously, it would imply that the index is now stale. However, in such a case, the SearchWorking flag is not modified. SearchWorking is marked ‘N’ if, and only if, the search operation directly fails continuously.
- The SearchWorking and IndexWorking flags can be managed through the Index Management Console in SMA. Refer to the corresponding documentation for more details.
- When a JVM detects continuous failure of index/search operations, it will notify you through alerts and events before disabling the corresponding flag.
- When an index/search operation is disabled, you need to fix the underlying problem and enable the operation again through the Index Management Console.
- While the index/search operation is disabled, every attempt to index/search will display a warning in the application logs.