IBM Support

Intermittent 503 error and pod restart issue

Troubleshooting


Problem

 

Customer reported that they were observing intermittent 503 (Service unavailable) response from the services registered on the mesh.  Multiple pod restarts for the consul-controller pod and the consul-connect-injector-webhook-deployment pods were observed. 

Versions in use

  • Ambassador Version: datawire/aes:2.1.2
    Consul version: consul-enterprise:1.10.6-ent
    Controller image: consul-k8s-control-plane:0.38.0
    Helm chart: v0.38.0

  • Could happen with other versions as well.

Cause

  • Check the controller log for the following - 
    2022-02-19T16:47:05.397Z INFO Waited for 1.047657907s due to client-side throttling, not priority and fairness, request: GET:https://172.xx.x.x:443/apis/storage.k8s.io/v1?timeout=32s

    This indicates a resource/timeout issue trying to reach the kube api server.

  • The kubectl logs consul-controller-xxxxxx --previous command shows the error message: problem running manager {"error": "leader election lost"}From the below stack trace, context deadline exceeded error was observed.

2022-02-20T22:48:21.168Z ERROR error retrieving resource lock service-mesh/consul.hashicorp.com: Get "https://172.xx.x.x:443/api/v1/namespaces/service-mesh/configmaps/consul.hashicorp.com": context deadline exceeded

k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:272
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:217
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:230
k8s.io/apimachinery/pkg/util/wait.poll
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:577
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:542
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:533
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:271
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.22.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:268
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/go/pkg/mod/k8s.io/client-go@v0.22.2/tools/leaderelection/leaderelection.go:212
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/manager/internal.go:681
2022-02-20T22:48:21.168Z INFO failed to renew lease service-mesh/consul.hashicorp.com: timed out waiting for the condition

2022-02-20T22:48:21.168Z ERROR setup problem running manager {"error": "leader election lost"}

This context deadline exceeded message means that the controller was unable to reach the kube api server and get a response in time.

Possible Causes of this ERROR could be any of the following:

  • Resource Contention
  • Slow I/O
  • Network Latency
  • Firewall Rules / Cloud Security Rules

The reason behind controller pod restarting is that - effectively we are retrying by restarting the controller pod, when the underlying kubernetes infrastructure is unable to run the pod.

 

Solutions:

  • Check the system monitoring for any resource contention on CPU/RAM/Disk activities.  If there is any resource contention, then increasing that should help.

  • If resources are good, then network latency should be checked.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSSJOV","label":"IBM Consul Self-Managed"},"ARM Category":[{"code":"","label":""}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Historical Number

5193064804243

Document Information

Modified date:
16 March 2026

UID

ibm17264128