Kafka pods in
CrashLoop
when storage size is not large enough
Symptoms
After issuing this command:
kubectl get pods | grep kafka
You receive a response similar to this:
<GI_CR_name>-kafka-0 0/1 CrashLoopBackOff 129 (3m53s ago) 2d21h
<GI_CR_name>-kafka-1 0/1 CrashLoopBackOff 136 (4m25s ago) 2d21h
<GI_CR_name>-kafka-2 0/1 CrashLoopBackOff 136 (2m35s ago) 2d21h
Where <GI_CR_name>
is the Guardium® Data Security Center custom resource (CR) file.
And if you check one of the logs (kubeclt logs <gi CR name>-kafka-0
), you
receive this:
java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:349)
at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:234)
at java.base/sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:313)
at java.base/sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:318)
at java.base/sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:160)
at java.base/java.io.OutputStreamWriter.flush(OutputStreamWriter.java:248)
at java.base/java.io.BufferedWriter.flush(BufferedWriter.java:257)
at org.apache.kafka.server.common.CheckpointFile.write(CheckpointFile.java:82)
at org.apache.kafka.storage.internals.checkpoint.CheckpointFileWithFailureHandler.write(CheckpointFileWithFailureHandler.java:46)
at org.apache.kafka.storage.internals.checkpoint.LeaderEpochCheckpointFile.write(LeaderEpochCheckpointFile.java:56)
at org.apache.kafka.storage.internals.epoch.LeaderEpochFileCache.flushTo(LeaderEpochFileCache.java:414)
at org.apache.kafka.storage.internals.epoch.LeaderEpochFileCache.flush(LeaderEpochFileCache.java:421)
at org.apache.kafka.storage.internals.epoch.LeaderEpochFileCache.truncateFromEnd(LeaderEpochFileCache.java:310)
at kafka.log.LogLoader.$anonfun$load$12(LogLoader.scala:182)
at kafka.log.LogLoader.$anonfun$load$12$adapted(LogLoader.scala:182)
at scala.Option.foreach(Option.scala:437)
at kafka.log.LogLoader.load(LogLoader.scala:182)
at kafka.log.UnifiedLog$.apply(UnifiedLog.scala:1804)
at kafka.log.LogManager.loadLog(LogManager.scala:278)
at kafka.log.LogManager.$anonfun$loadLogs$15(LogManager.scala:421)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:857)
Causes
This problem occurs when the Kafka storage size is not large enough.Resolving the problem
- Edit the Guardium Data Security Center CR
file:
kubectl edit guardiuminsights <GI_CR_name>
- Locate this section of the file:
spec: .... dependency-kafka: kafka: storage: class: <block storage class> size: <size> type: persistent-claim
If the section is not under
spec:
, add it as in the above code block. - Increase the
size:
- for example, increase it to500Gi
:spec: .... dependency-kafka: kafka: storage: class: <block storage class> size: 500Gi type: persistent-claim
- Edit the Kafka CR
file:
kubectl edit kafka <GI_CR_name>
- Increase the
size:
in this file, as well:spec: ..... kafka: ..... storage: class: gp3-csi size: 500Gi type: persistent-claim
- To verify that the
pvc
for Kafka has been increased, issue this command:kubectl get pvc | grep kafka
This should return the new size:data-sysqagi-kafka-0 Bound pvc-8b9c6b5c-7ee4-4a6d-8a93-12db1c82e7cd 500Gi RWO gp3-csi 3d12h data-sysqagi-kafka-1 Bound pvc-be5f5ce4-7627-4971-af65-36fcb42d45ad 500Gi RWO gp3-csi 3d12h data-sysqagi-kafka-2 Bound pvc-23808ff4-eb0b-46d3-8883-b73e79841076 500Gi RWO gp3-csi 3d12h
- Kafka should resume running after 10 to 25 minutes.
If it does not return to a healthy state in that time, force delete
it:
kubectl delete <gi CR name>-kafka-0 <gi CR name>-kafka-1 <gi CR name>-kafka-2
- After force deletion, issue this
command:
kubectl get pods | grep kafka
The status should now appear similar to this:
<GI_CR_name>-kafka-0 1/1 Running 0 2m30s <GI_CR_name>-kafka-1 1/1 Running 0 2m30s <GI_CR_name>-kafka-2 1/1 Running 0 2m30s
If the Kafka pods remain in a
CrashLoop
after the above steps, wait approximately 6 hours and then issue this
command:
kubectl get pv | grep kafka
If the results of the command show that the pods are still the old size, for example:
pvc-1e527b30-0bb8-4e15-885c-0b7498b36656 250Gi RWO Delete Bound sysqa/data-sysqagi-kafka-2 gp3-csi 6d6h
pvc-b3be4f4e-41bf-45b8-aca9-1bd727866d28 250Gi RWO Delete Bound sysqa/data-sysqagi-kafka-1 gp3-csi 6d6h
pvc-b568508f-5322-4157-8db0-eb767a93d495 250Gi RWO Delete Bound sysqa/data-sysqagi-kafka-0 gp3-csi 6d6h
You can manually update the size of each one. Update the PV CR with the updated storage. First, edit the PV:
kubectl edit pv <pv_name>
Then update the size, for example, from 250Gi
to 500Gi
:
spec:
.....
capacity:
storage: 500Gi