IBM Support

MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Linux on Containers

Troubleshooting


Problem

If you are experiencing performance, hang, or high CPU issues with WebSphere Application Server on Linux on Containers (e.g. OpenShift or Kubernetes), this MustGather will assist you in collecting the data necessary to diagnose and resolve the issue.

Resolving The Problem

This tool gathers diagnostics without requiring any tool installation nor restarts of containers. It does this by using worker node debug pods to gather diagnostics on the worker node(s) rather than within the container(s).
Steps
  1. Note: This tool requires that you are logged in with a user that has the cluster-admin superuser privilege.
  2. Ensure that you have the oc command on your PATH and that you are logged into your cluster.
  3. Determine the name of the deployment or pod that you want to gather diagnostics for and the namespace it's in.
  4. Download a helper script:
    1. macOS or Linux: containerdiag.sh (and then make it executable with chmod +x containerdiag.sh)
    2. Windows: containerdiag.bat
  5. On macOS or Linux, make the script executable:
    chmod +x containerdiag.sh
  6. On macOS, remove the download quarantine:
    xattr -d com.apple.quarantine containerdiag.sh
  7. Start the diagnostics:
    1. For WebSphere Liberty or OpenLiberty, use the libertyperf.sh script and replace $DEPLOYMENT with the deployment name:
      ./containerdiag.sh -d $DEPLOYMENT libertyperf.sh
      (containerdiag.bat on Windows)
    2. For WebSphere Application Server traditional, use the twasperf.sh script and replace $DEPLOYMENT with the deployment name:
      ./containerdiag.sh -d $DEPLOYMENT twasperf.sh
      (containerdiag.bat on Windows)
  8. After the command completes on each pod, the output will instruct you how to download the diagnostics in a new terminal window. For example:
    run.sh: Files are ready for download. Download with the following command in another window:
    
      oc cp worker1-debug:/tmp/containerdiag.abc.tar.gz containerdiag.abc.tar.gz --namespace=ffzhc74l4c
    
    After the download is complete, type OK and press ENTER: 
    
  9. After the download is complete, go back to the original terminal window and type "OK" and press enter. The script will continue iterating over the other pods or finish if there are no more pods.
  10. Upload the containerdiag*.tar.gz file(s) to the support case.

Limitations
  1. libertyperf.sh cannot gather a Liberty server dump on vanilla Kubernetes clusters (i.e. not OpenShift) due to a permissions difference between oc debug node and kubectl debug node. If the cluster is OpenShift, ensure you're using oc instead of kubectl.
  2. libertyperf.sh and twasperf.sh can only gather files (e.g. logs, configuration, javacores, etc.) from the container's ephemeral filesystem and cannot gather files from mounted persistent volumes. You may separately access such files through the underlying filesystem instead (e.g. NFS). You may vote on the feature request to support this if it affects you.
Notes
  1. By default, the script uses the quay.io/ibm/containerdiag image which is downloaded from the Red Hat Quay.io registry into your cluster's container registry and executed. Therefore, the first time you run this, it may spend a long time after the output To use host binaries, run `chroot /host` as it is probably downloading the image.
  2. If your cluster does not have internet connectivity to quay.io, you may download the image locally, push the image to your cluster's container registry, and then use the -i option to use your cluster's image (see an example of how this may be done); for example:
    containerdiag.sh -i image-registry.openshift-image-registry.svc/ibm/containerdiag -d $DEPLOYMENT libertyperf.sh
  3. If you have any concerns using the quay.io/ibm/containerdiag image, you may build the image yourself using the source Containerfile and then use the -i option to use your custom image.
  4. To target a specific pod, use -p $POD instead of -d $DEPLOYMENT; for example:
    containerdiag.sh -p $POD libertyperf.sh
  5. By default, your "current" namespace is used. You may override this with -n $NAMESPACE; for example:
    containerdiag.sh -d $DEPLOYMENT -n $NAMESPACE libertyperf.sh
  6. The containerdiag image is built for Linux on the Intel 64-bit (x86_64), POWER 64-bit (ppc64le), IBM z 64-bit (s390x), and ARMv8 64-bit (arm64/aarch64) platforms.
  7. The libertyperf.sh and twasperf.sh scripts support the following options (in seconds) to control the underlying linperf.sh (and a -c option to specify the javacore directory):
    -c Path to javacores (default /output/javacore*)
    -f Configuration directory (default /config)
    -j JAVACORE_INTERVAL
    -l Logs directory (default /logs)
    -m VMSTAT_INTERVAL
    -s SCRIPT_SPAN
    -t TOP_INTERVAL
    -u TOP_DASH_H_INTERVAL
    
    For example, to change the SCRIPT_SPAN to 60 seconds:
    containerdiag.sh -d $DEPLOYMENT libertyperf.sh -s 60
  8. This tool also has other capabilities such as:
    1. Run tcpdump for a number of seconds:
      containerdiag.sh -d $DEPLOYMENT -q tcpdump.sh -0 $SECONDS
    2. Run the Linux perf native stack sampler for a number of seconds:
      containerdiag.sh -d $DEPLOYMENT -q perf.sh -d $SECONDS
    3. Note that -q is needed in the above examples because otherwise each pod name would be passed as an argument which would cause the commands to fail.
    4. Any other command available in the underlying image.
  9. You may see some output repeated such as the "Files are ready for download" prompt. This is expected. The reason is that oc debug may time out after 1 minute of no input or output and if you haven't downloaded the files by then, the pod would have been deleted and the files would be gone; therefore, we periodically output some content so that the pod is not destroyed until you've completed the download.
  10. The debug pod automatically completes once the download is complete, so no further cleanup is required. You may find *debug* pods in the pod list with a status of "Completed" and a "Ready" column value of 0/1. The latter means that 0 containers are running for that pod. This is expected and a general feature of Kubernetes to keep completed pods around in case an administrator wants to look at their logs. They should be automatically deleted by Kubernetes garbage collection when disk usage exceeds certain thresholds, although you may delete them manually if you'd like.
  11. When downloading containerdiag*.tar.gz, you may receive an "EOF" error. Ensure you are running a version of kubectl or oc that matches your cluster version. Try again or try with --retries=-1. If the problem persists, open a support case with your vendor.
Version History
  1. April 26, 2023: Handle containerd (e.g. AKS/EKS 1.25) and add -l (logs) and -f (config) directory override options for libertyperf.sh and twasperf.sh
  2. March 1, 2023: Do not assume replicas are active/ready. Diagnostics are still useful (and particularly so) if pods are not in those states.
  3. February 27, 2023: Fix Windows batch script handling of arguments with an asterisk in them
  4. November 8, 2022: Add -c option to libertyperf.sh and twasperf.sh (and new javathreaddumps.sh) to allow overriding the location of javacores in the container
  5. June 27, 2022: First version
This tool is provided as is without any warranty or support. Please report any issues through the GitHub repository and we'll try to resolve any issues as time permits.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"ARM Category":[{"code":"a8m3p000000F7ylAAC","label":"IBM WebSphere Liberty-All Platforms-\u003EHang Performance CPU"},{"code":"a8m50000000CdBVAA0","label":"WebSphere Application Server traditional-All Platforms-\u003EHang Performance CPU"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSA3RN","label":"IBM Semeru Runtimes"},"ARM Category":[],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB36","label":"IBM Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSNVBF","label":"Runtimes for Java Technology"},"ARM Category":[],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
26 April 2023

UID

ibm16594537