Troubleshooting Failed Job

Job has failed and has been redelivered

In the context of an application deployed on Kubernetes/OpenShift, an Optimization Server Worker does not complete its execution and no explicit error is visible in the container logs. In the Job list widget, the job is marked as failed with the following message:

java.lang.RuntimeException: Job had failed and has been redelivered, then abandoned

This is usually caused by a worker pod that tries to allocate more memory that it is allowed to do by the Kubernetes configuration. To solve the problem, change the memory limits in the worker service helm chart to a more appropriate value:

   spec:
      containers:
        - env:
            - name: JAVA_TOOL_OPTIONS # Configures the JVM memory
              value: -Xmx4000m -Xms500m -XX:+CrashOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/carhartt-mso-checker-worker-heap-dump.hprof
            ...
          resources: # Configures the Kubernetes Pod resources
            limits:
              memory: 4256Mi
            requests:
              cpu: 100m
              memory: 1000Mi
      ...