IBM Match 360 event streaming message reconciliation
Even when Apache Kafka is configured for high availability, there is the potential for message streaming systems to fail. In these situations, the IBM Match 360 event streaming feature includes safeguards and an automatic reconciliation process to help ensure that messages are not lost.
The IBM Match 360 event streaming feature sends record and entity change events to Apache Kafka. To avoid losing event streaming messages when system problems occur, use a Kafka server that is deployed for high availability.
To further protect against streaming message loss, IBM Match 360 follows a strict event streaming lifecycle that includes message staging, publishing, and reconciliation. This process helps to ensure that event streaming messages do not get lost.
The IBM Match 360 event streaming lifecycle
If the IBM Match 360 event streaming capability is enabled with valid subscriptions, the service follows a standard event streaming lifecycle to help protect messages from being lost:
- The IBM Match 360 data microservice
(
data-ms) generates events for each record change and entity change and publishes the events through an Apache Kafka connection to an external Kafka server. - If there is a publishing failure for any reason, the message gets staged in the internal IBM Match 360 RabbitMQ instance.
- For all staged messages, IBM Match 360 periodically runs an internal reconciliation process that attempts to republish (synchronize) any failed messages from the internal RabbitMQ instance to Apache Kafka.
- Within the reconciliation flow, a data engineer can configure the retry count and interval. IBM Match 360 attempts to synchronize failed events based on the retry configuration.
- If all of the synchronization attempts fail, IBM Match 360 logs the failed event streaming message so that data engineers or administrators can trace, investigate, and resolve the reason for the failure.
Potential causes for message failures
- Network intermittency. Typically, this type of failure corrects itself on subsequent retry attempts.
- Kafka server connection failure. This type of failure can happen due to issues such as a password reset, credential change, or if the Kafka server is unavailable. In general, connection failure events must be stored in internal staging area until Kafka is available again. When the Kafka server is reachable again, the events can be republished successfully.
- Message or configuration issues. These types of failure can require followup
configuration to resolve the underlying reason for the failure. These types of failures can occur
for reasons such as the following:
- The message is faulty or too large.
- The Kafka Producer configuration is not correct. For example, the buffer size might be set too small.
Investigating and resolving message and configuration issues
When a IBM Match 360 event streaming message fails because of a message or configuration issue, data engineers or administrators must investigate and resolve the underlying configuration problems. These types of errors must be carefully handled with proper configuration.
To be able to correct the configuration settings and resolve this type of issue, you must have a good understanding of how the reconciliation process and RabbitMQ listeners work.
waitTime), which is set to every five minutes by default. With each five
minute cycle, this managed thread ensures that there is a valid event streaming subscription in
place and that a connection to the Kafka server is available. - If both conditions are true, the thread starts the RabbitMQ listener if it is not yet running.
- If either condition is false, the thread stops the RabbitMQ listener.
During processing the RabbitMQ staging queue,
RabbitMQ has a
consumer_time_out property that is set to 30 minutes by default. If a given
message is not acknowledged within 30 minutes, the channel gets restarted and all of the messages
get sent back to the staging queue. This scenario will cause duplicate messages.
- Determining the right configuration settings
-
All of the reconciliation-related configuration settings are carefully configured by default. Changing the settings should only be done if you are certain of the implications. If you plan to use custom settings, be sure to adhere to the following guidance.Note: Ideally, you should not have to change any settings except for TTL and the maximum queue length. These settings should be defined by your Service Level Agreement (SLA) with IBM. The SLA should include details about what to do if Kafka is down. How many days or how many events will be staged in backup before events are lost? In theory, if your organization can provide unlimited storage, then the solution can store an unlimited number of events, but in practice there is always a limit. An SLA should be agreed upon that takes into account the storage limit, event size, and event generation rate.Based on the aforementioned
waitTimeandconsumer_time_outbehaviors, use the following formulas to guide you when determining certain configuration settings:reconRetryCount * reconRetryInterval > waitTimePrefetch * reconRetryInterval < consumer_time_out
waitTime= 5 minutesreconRetryCount= 5 attemptsreconRetryInterval= 2 minutesPrefetch= 10 prefetched messagesconsumer_time_out= 30 minutes
- Example: Connection errors
-
When there is a connection error, the messages should remain in the staging queue until Kafka is reachable. The managed thread will only start the RabbitMQ listener if it can connect to Kafka.
Using the default configuration, after five minutes the thread notices that Kafka is still down. It then stops the RabbitMQ listener and, since it has not received a positive acknowledgment from the server, stages the message for a later republish attempt when the connection is available again. Human intervention should not be required to resolve this type of issue.
- Example: Message configuration errors
-
When there is a message configuration error that cannot be resolved by the system, human intervention is required. Kafka is up and running, but the message is failing because of a different issue. Using the default configuration, the message is retried five times. This can take up to 20 minutes, assuming all of the 10 prefetched messages have a problem. After five failed retries, the message gets logged so that a data engineer or administrator can review and resolve the problem.
An example of a typical message and configuration issue that requires intervention is when the buffer size is too small. ABufferExhaustedExceptionerror will be thrown by the Kafka client during message publishing. This error can be caused by situations such as:- The message size is too large.
- The Kafka message size limit is too small.
- Event throughput is too high, causing the buffer limit to be exceeded.
- The Kafka buffer limit is too low.
- Reduce the message size using the
includeRecordCompositionproperty in the configmap. - Change the serializer type to AVRO binary compressible format by using the
serializerTypeproperty in the configmap. - Increase the message size limit and buffer size in the Kafka Producer (see the
general_propertiessection).
For more information about reviewing the logs for event streaming messages, see Tracing and debugging IBM Match 360 event streaming issues.
For more information about configuring event streaming settings, see Configuring event streaming in IBM Match 360.