IBM Match 360 event streaming message reconciliation

Even when Apache Kafka is configured for high availability, there is the potential for message streaming systems to fail. In these situations, the IBM Match 360 event streaming feature includes safeguards and an automatic reconciliation process to help ensure that messages are not lost.

The IBM Match 360 event streaming feature sends record and entity change events to Apache Kafka. To avoid losing event streaming messages when system problems occur, use a Kafka server that is deployed for high availability.

To further protect against streaming message loss, IBM Match 360 follows a strict event streaming lifecycle that includes message staging, publishing, and reconciliation. This process helps to ensure that event streaming messages do not get lost.

The IBM Match 360 event streaming lifecycle

If the IBM Match 360 event streaming capability is enabled with valid subscriptions, the service follows a standard event streaming lifecycle to help protect messages from being lost:

The IBM Match 360 data microservice (data-ms) generates events for each record change and entity change and publishes the events through an Apache Kafka connection to an external Kafka server.
If there is a publishing failure for any reason, the message gets staged in the internal IBM Match 360 RabbitMQ instance.
For all staged messages, IBM Match 360 periodically runs an internal reconciliation process that attempts to republish (synchronize) any failed messages from the internal RabbitMQ instance to Apache Kafka.
Within the reconciliation flow, a data engineer can configure the retry count and interval. IBM Match 360 attempts to synchronize failed events based on the retry configuration.
If all of the synchronization attempts fail, IBM Match 360 logs the failed event streaming message so that data engineers or administrators can trace, investigate, and resolve the reason for the failure.

Potential causes for message failures

IBM Match 360 event streaming messages can encounter delivery problems in the following scenarios:

Network intermittency. Typically, this type of failure corrects itself on subsequent retry attempts.
Kafka server connection failure. This type of failure can happen due to issues such as a password reset, credential change, or if the Kafka server is unavailable. In general, connection failure events must be stored in internal staging area until Kafka is available again. When the Kafka server is reachable again, the events can be republished successfully.
Message or configuration issues. These types of failure can require followup configuration to resolve the underlying reason for the failure. These types of failures can occur for reasons such as the following:
- The message is faulty or too large.
- The Kafka Producer configuration is not correct. For example, the buffer size might be set too small.
In case of a message or configuration issue, attempts to republish the failed message do not help. The root cause for the problem must be addressed first to enable success. These types of failures are logged so that data engineers or administrators can review and fix the problem.

Investigating and resolving message and configuration issues

When a IBM Match 360 event streaming message fails because of a message or configuration issue, data engineers or administrators must investigate and resolve the underlying configuration problems. These types of errors must be carefully handled with proper configuration.

To be able to correct the configuration settings and resolve this type of issue, you must have a good understanding of how the reconciliation process and RabbitMQ listeners work.

The reconciliation process uses an internally managed thread that reruns based on the configured wait time (waitTime), which is set to every five minutes by default. With each five minute cycle, this managed thread ensures that there is a valid event streaming subscription in place and that a connection to the Kafka server is available.

If both conditions are true, the thread starts the RabbitMQ listener if it is not yet running.
If either condition is false, the thread stops the RabbitMQ listener.

Essentially, the managed thread manages the RabbitMQ listeners and pulls messages from the staging queue only if Kafka is up and running.

During processing the RabbitMQ staging queue, RabbitMQ has a consumer_time_out property that is set to 30 minutes by default. If a given message is not acknowledged within 30 minutes, the channel gets restarted and all of the messages get sent back to the staging queue. This scenario will cause duplicate messages.

Determining the right configuration settings

All of the reconciliation-related configuration settings are carefully configured by default. Changing the settings should only be done if you are certain of the implications. If you plan to use custom settings, be sure to adhere to the following guidance.

Note: Ideally, you should not have to change any settings except for TTL and the maximum queue length. These settings should be defined by your Service Level Agreement (SLA) with IBM. The SLA should include details about what to do if Kafka is down. How many days or how many events will be staged in backup before events are lost? In theory, if your organization can provide unlimited storage, then the solution can store an unlimited number of events, but in practice there is always a limit. An SLA should be agreed upon that takes into account the storage limit, event size, and event generation rate.

Based on the aforementioned waitTime and consumer_time_out behaviors, use the following formulas to guide you when determining certain configuration settings:

reconRetryCount * reconRetryInterval > waitTime
Prefetch * reconRetryInterval < consumer_time_out

The default values for the settings mentioned in these formulas are as follows:

waitTime = 5 minutes
reconRetryCount = 5 attempts
reconRetryInterval = 2 minutes
Prefetch = 10 prefetched messages
consumer_time_out = 30 minutes

For more information about configuring these and other IBM Match 360 event streaming settings, see Configuring event streaming in IBM Match 360.

Example: Connection errors

When there is a connection error, the messages should remain in the staging queue until Kafka is reachable. The managed thread will only start the RabbitMQ listener if it can connect to Kafka.

Using the default configuration, after five minutes the thread notices that Kafka is still down. It then stops the RabbitMQ listener and, since it has not received a positive acknowledgment from the server, stages the message for a later republish attempt when the connection is available again. Human intervention should not be required to resolve this type of issue.

Example: Message configuration errors

When there is a message configuration error that cannot be resolved by the system, human intervention is required. Kafka is up and running, but the message is failing because of a different issue. Using the default configuration, the message is retried five times. This can take up to 20 minutes, assuming all of the 10 prefetched messages have a problem. After five failed retries, the message gets logged so that a data engineer or administrator can review and resolve the problem.

An example of a typical message and configuration issue that requires intervention is when the buffer size is too small. A BufferExhaustedException error will be thrown by the Kafka client during message publishing. This error can be caused by situations such as:

The message size is too large.
The Kafka message size limit is too small.
Event throughput is too high, causing the buffer limit to be exceeded.
The Kafka buffer limit is too low.

With this type of failure, subsequent retries will not change the result unless the underlying issue is resolved. These errors are logged, and data engineers or administrators should review the logs to diagnose the problem and take corrective action. Some of the possible resolutions for this issue include:

Reduce the message size using the includeRecordComposition property in the configmap.
Change the serializer type to AVRO binary compressible format by using the serializerType property in the configmap.
Increase the message size limit and buffer size in the Kafka Producer (see the general_properties section).

For more information about reviewing the logs for event streaming messages, see Tracing and debugging IBM Match 360 event streaming issues.

For more information about configuring event streaming settings, see Configuring event streaming in IBM Match 360.