Kafka pods in CrashLoop when storage size is not large enough

Symptoms

After issuing this command:

kubectl get pods | grep kafka

You receive a response similar to this:

<GI_CR_name>-kafka-0 0/1 CrashLoopBackOff 129 (3m53s ago) 2d21h
<GI_CR_name>-kafka-1 0/1 CrashLoopBackOff 136 (4m25s ago) 2d21h
<GI_CR_name>-kafka-2 0/1 CrashLoopBackOff 136 (2m35s ago) 2d21h

Where <GI_CR_name> is the Guardium® Data Security Center custom resource (CR) file.

And if you check one of the logs (kubeclt logs <gi CR name>-kafka-0), you receive this:

java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:349)
at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:234)
at java.base/sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:313)
at java.base/sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:318)
at java.base/sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:160)
at java.base/java.io.OutputStreamWriter.flush(OutputStreamWriter.java:248)
at java.base/java.io.BufferedWriter.flush(BufferedWriter.java:257)
at org.apache.kafka.server.common.CheckpointFile.write(CheckpointFile.java:82)
at org.apache.kafka.storage.internals.checkpoint.CheckpointFileWithFailureHandler.write(CheckpointFileWithFailureHandler.java:46)
at org.apache.kafka.storage.internals.checkpoint.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.java:56)
at org.apache.kafka.storage.internals.epoch.LeaderEpochFileCache.flushTo(LeaderEpochFileCache.java:414)
at org.apache.kafka.storage.internals.epoch.LeaderEpochFileCache.flush(LeaderEpochFileCache.java:421)
at org.apache.kafka.storage.internals.epoch.LeaderEpochFileCache.truncateFromEnd(LeaderEpochFileCache.java:310)
at kafka.log.LogLoader.$anonfun$load$12(LogLoader.scala:182)
at kafka.log.LogLoader.$anonfun$load$12$adapted(LogLoader.scala:182)
at scala.Option.foreach(Option.scala:437)
at kafka.log.LogLoader.load(LogLoader.scala:182)
at kafka.log.UnifiedLog$.apply(UnifiedLog.scala:1804)
at kafka.log.LogManager.loadLog(LogManager.scala:278)
at kafka.log.LogManager.$anonfun$loadLogs$15(LogManager.scala:421)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:857)

Causes

This problem occurs when the Kafka storage size is not large enough.

Resolving the problem

  1. Edit the Guardium Data Security Center CR file:
    kubectl edit guardiuminsights <GI_CR_name>
  2. Locate this section of the file:
    spec:
    ....
      dependency-kafka:
        kafka:
          storage:
            class: <block storage class>
            size: <size>
            type: persistent-claim

    If the section is not under spec:, add it as in the above code block.

  3. Increase the size: - for example, increase it to 500Gi:
    spec:
    ....
      dependency-kafka:
        kafka:
          storage:
            class: <block storage class>
            size: 500Gi
            type: persistent-claim
  4. Edit the Kafka CR file:
    kubectl edit kafka <GI_CR_name>
  5. Increase the size: in this file, as well:
    spec:
    .....
      kafka:
    .....
        storage:
          class: gp3-csi
          size: 500Gi
          type: persistent-claim
  6. To verify that the pvc for Kafka has been increased, issue this command:
    kubectl get pvc | grep kafka
    This should return the new size:
    data-sysqagi-kafka-0    Bound    pvc-8b9c6b5c-7ee4-4a6d-8a93-12db1c82e7cd    500Gi    RWO    gp3-csi    3d12h
    data-sysqagi-kafka-1    Bound    pvc-be5f5ce4-7627-4971-af65-36fcb42d45ad    500Gi    RWO    gp3-csi    3d12h
    data-sysqagi-kafka-2    Bound    pvc-23808ff4-eb0b-46d3-8883-b73e79841076    500Gi    RWO    gp3-csi    3d12h
  7. Kafka should resume running after 10 to 25 minutes. If it does not return to a healthy state in that time, force delete it:
    kubectl delete <gi CR name>-kafka-0 <gi CR name>-kafka-1 <gi CR name>-kafka-2
  8. After force deletion, issue this command:
    kubectl get pods | grep kafka

    The status should now appear similar to this:

    <GI_CR_name>-kafka-0 1/1 Running 0 2m30s
    <GI_CR_name>-kafka-1 1/1 Running 0 2m30s
    <GI_CR_name>-kafka-2 1/1 Running 0 2m30s

If the Kafka pods remain in a CrashLoop after the above steps, wait approximately 6 hours and then issue this command:

kubectl get pv | grep kafka

If the results of the command show that the pods are still the old size, for example:

pvc-1e527b30-0bb8-4e15-885c-0b7498b36656   250Gi   RWO   Delete   Bound   sysqa/data-sysqagi-kafka-2   gp3-csi   6d6h
pvc-b3be4f4e-41bf-45b8-aca9-1bd727866d28   250Gi   RWO   Delete   Bound   sysqa/data-sysqagi-kafka-1   gp3-csi   6d6h
pvc-b568508f-5322-4157-8db0-eb767a93d495   250Gi   RWO   Delete   Bound   sysqa/data-sysqagi-kafka-0   gp3-csi   6d6h

You can manually update the size of each one. Update the PV CR with the updated storage. First, edit the PV:

kubectl edit pv <pv_name>

Then update the size, for example, from 250Gi to 500Gi:

spec: 
.....
  capacity: 
    storage: 500Gi