Troubleshooting Certified Container Installations
This topic includes a list of troubleshooting techniques for resolving deployment, connectivity, performance, and core product issues on IBM® Sterling B2B Integrator and IBM Sterling File Gateway Certified Container deployments.
Deployment Issues
Gather the following list of information to troubleshoot deployment issues:
- Helm deployment issues:
- Capture the helm install output in verbose mode by running the
command:
Helm install <release-name> <path/to/charts/directory> --values </path/to/values-override.yaml> --debug –timeout <timeout value>
- Capture the template manifests generated without installing the helm
charts by running the command:
Helm install <release-name> <path/to/charts/directory> --values </path/to/values-override.yaml> --dry-run –timeout <timeout value>
You can combine both the –debug and –dry-run flags in the same command and provide the output and manifest files generated.
- Capture the helm install output in verbose mode by running the
command:
- Pods not getting scheduled:
- Capture details for the pod stuck in pending state by running the
command:
[oc/kubectl] describe pod <pod-name> –n <namespace>
Note: Ensure there are enough resources (CPU/Memory) on the worker nodes to allocate to the pod. For example, check pod events and look for (0/3 nodes insufficient CPU).
- Capture details for the pod stuck in pending state by running the
command:
- Pods constantly restart:
- Check the pod console logs for any application start-up errors causing the pod start-up to fail. If there are application failures, check if the errors are due to any misconfiguration or database connectivity issues.
- If the ASI pod restarts during the start-up and the pod events show start-up probe failures, consider increasing the asi.startupProbe.initialDelaySeconds and asi.startupProbe.failureThreshold values in the helm values overrides file. This allows the application enough time to install the customizations and start the server.
- If the AC or API pods constantly restarts during the start-up and the pod events show liveness probe failures, consider increasing the ac/api.livenessProbe.initialDelaySeconds or ac/api.livenessProbe.periodSeconds values in the helm values overrides file. This allows the application enough time to start the servers.
- Capture details for the pod constantly restarting by running the
command:
[oc/kubectl] describe pod <pod-name> –n <namespace>
- Capture logs for the pod that is constantly restarting by running
the command:
[oc/kubectl] logs pod <pod-name> –n <namespace>
- Pods take too long to get ready:
- Check the connection to the database and its latency.
- Check and provide the details of the customizations deployed (custom jars, custom services, user exits, etc.).
- Check the start-up probe, liveness probe, and readiness probe time
intervals in the helm values override file and tune the
initialDelaySeconds and
periodSeconds to match the approximate
application server start-up time, as observed in the pod console
logs. Tip: Configuring unnecessarily large values for initial delays and period intervals can delay the pod from getting into the ready state.
- If the application is still taking time to start, then capture the
pod console logs with time stamp by running the below
command:
[oc/kubectl] logs pod <pod-name> --timestamps –n <namespace>
- Also, generate about 3 thread dumps at an interval of 10 seconds each. To generate the java core or heap dump files, see Generating Java Core Dumps.
- Pods getting evicted:
- Capture the time span metrics of the pod for the time duration it was evicted.
- Check the memory usage and verify if it breached the configured limit for memory in the helm values override file.
- Similarly, check the filesystem usage and verify if it breached the configured limit for ephemeral storage in the helm values override file.
Core Product Issues
Gather the following list of information to troubleshoot core product issues:
- Capture product logs:
- If the application is mapped to volume, go to the mapped volume location, zip and copy the logs.
- If the application is pushed to console (logs.enableAppLogOnConsole is enabled in helm configurations), extract the logs from the configured logging stack, for example, OpenShift logging or EFK logging stack. Refer to the logging stack documentation for details on various ways to extract the relevant logs for a given date or time range.
- Capture product configurations:
- Obtain customer_override configuration properties from the customization UI.
- Obtain all configurations specific to the issue, such as, Business Process configurations, Adapter configurations, and so on.
- Capture java core and heap dumps:
- To analyze the issue or if the issue is performance related, collect the java core and heap dumps. To generate the java core or heap dump files, see Generating Java Core Dumps.
Network/Connectivity issues
Gather the following list of information to troubleshoot connectivity issues:
- Identify and verify the list of services and port configurations:
[oc/kubectl] get services –n <namespace> -o wide
- Identify and verify list of ingress/routes along with the host names,
service ports, and TLS configurations:
[oc/kubectl] get ingress –n <namespace> -o wide
oc get routes -n <namespace> -o wide
Note: For dashboard route configured for re-encrypt encryption, verify that the destinationCACertificate on the route matches with the ASISSLCert or the certificate patched in the secret configured against the asi.internalAccess.tlsSecretName in the helm values overrides file. - Identify and verify the list of network policies to check if any policy is
blocking traffic on an ingress or egress endpoint:
[oc/kubectl] get networkpolicy –n <namespace> -o wide
- Identify and verify the external load balancer and/or ingress controller
configurations:Execute into the pod and run basic connectivity tests to check network connectivity to the database or other affected services. To do this:
- Run the command to open an interactive shell in the
pod:
This can also be accessed from the OpenShift web console pod terminal.[oc/kubectl] exec <pod-name> -it bash
- Once inside the pod, capture output for basic networking tests
using ping, netstat, and curl tools:
ping <destination addresss>
netstat –aneo
curl –v protocol://host:port
- Run the command to open an interactive shell in the
pod:
Performance Issues
Gather the following list of information to troubleshoot performance issues:
- CPU and Memory.
- Total CPU and memory available on the worker nodes.
- CPU and memory assigned to the pods and the number of application pods.
- Obtain the Pod CPU and Memory usage metrics from OpenShift web console or any other basic monitoring tool configured on the cluster. The metric should be captured for the duration of the current issue.
- Provide storage details (type and size). Cloud storage is usually controlled by IOPS/GB.
- Type of worker nodes. For example, AWS has several node classes and some limit the network bandwidth.
- Database type and size. Identify if it's an enterprise class cloud service. Ensure the recommended database settings are enabled. For more information, see Database Management.
- Standard Sterling B2B Integrator tuning. Capture the tuning.properties applied to the ASI/AC pods along with snapshots of the queue watcher for queue depth and active queue threads. Also, capture snapshots of the database statistics from the troubleshooting section on dashboard.
- Capture java core and heap dumps to analyze the performance related issue. To generate the java core or heap dump files, see Generating Java Core Dumps.