Inference service stops processing

The inference service stops processing and does not restart in your IBM® Netcool® Operations Insight® on Red Hat® OpenShift® deployment. Restart pods to workaround the issue.

Problem

The inference service throws a Redis Stack trace, stops processing, but pods remain running and are not restarted by inference self monitoring.

Example stack trace:
> INFO  [2022-07-07 13:06:54,636] org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-inferenceService-10, groupId=inferenceService] Member consumer-inferenceService-10-9e5bf334-0510-4741-9d87-908d4cc2d853 sending LeaveGroup request to coordinator evtmanager-kafka-4.evtmanager-kafka.noi.svc.cluster.local:9092 (id: 2147483643 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
> ERROR [2022-07-11 02:36:29,154] redis.clients.jedis.JedisSentinelPool: Lost connection to Sentinel at evtmanager-ibm-redis:26379. Sleeping 5000ms and retrying.
> ! java.net.SocketException: Connection timed out (Read failed)
> ! at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
> ! at java.base/java.net.SocketInputStream.read(Unknown Source)
> ! at java.base/java.net.SocketInputStream.read(Unknown Source)
> ! at java.base/java.net.SocketInputStream.read(Unknown Source)
> ! at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:195)
> ! ... 9 common frames omitted
> ! Causing: redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Connection timed out (Read failed)
> ! at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:201)
> ! at redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:40)
> ! at redis.clients.jedis.Protocol.process(Protocol.java:141)
> ! at redis.clients.jedis.Protocol.read(Protocol.java:205)
> ! at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:297)
> ! at redis.clients.jedis.Connection.getRawObjectMultiBulkReply(Connection.java:242)
> ! at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:108)
> ! at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:102)
> ! at redis.clients.jedis.Jedis.subscribe(Jedis.java:2628)
> ! at redis.clients.jedis.JedisSentinelPool$MasterListener.run(JedisSentinelPool.java:290)

Resolution

To work around this issue, delete the inference service pods.
  1. Run the following command to delete the inference service pods.
    oc delete po <inferenceservice>
  2. Confirm that the pods have restarted and are in a Running state, by running the following command:
    oc get po | grep inference
    Example output:
    evtmanager-ibm-hdm-analytics-dev-inferenceservice-f897cf7bhf2p9   1/1     Running     0                3m
    evtmanager-ibm-hdm-analytics-dev-inferenceservice-f897cf7bl2pn5   1/1     Running     0                29d