Troubleshooting

This guide provides solutions to common issues encountered during installation, configuration, and operation of HashiCorp Vault on OpenShift Container Platform with s390x architecture.

HashiCorp Vault on OpenShift (s390x)

Helm installation fails with "repository not found"

Symptoms:


Error: failed to download "hashicorp/vault"

Solution:

  1. Verify the Helm repository is added correctly:
    
    helm repo list
    
  2. If not present, add the repository:
    
    helm repo add hashicorp https://helm.releases.hashicorp.com
       helm repo update
    
  3. Retry the installation.

Image pull errors for s390x architecture

Symptoms:


Failed to pull image "icr.io/ibm-vault-for-z:1.19.12-ent": rpc error: code = Unknown desc = Error reading manifest

Solution:

  1. Verify image repository access:
    
    oc get pods -n vault
    oc describe pod vault-0 -n vault
    
  2. Check if image pull secrets are required:
    
    oc create secret docker-registry ibm-registry \
      --docker-server=icr.io \
      --docker-username=<username> \
      --docker-password=<password> \
      -n vault
    
  3. Update values.yml to reference the secret:
    
    server:
      image:
        repository: "icr.io/ibm-vault-for-z"
        tag: "1.19.12-ent"
        pullPolicy: IfNotPresent
      imagePullSecrets:
        - name: ibm-registry
    
  4. Upgrade the Helm release:
    
    helm upgrade vault hashicorp/vault -n vault -f values.yml
    

Namespace does not exist

Symptoms:


Error: namespaces "vault" not found

Solution: Create the namespace before installation:


oc new-project vault

Pods stuck in crashloopbackoff

Symptoms:


NAME      READY   STATUS             RESTARTS
vault-0   0/1     CrashLoopBackOff   5

Solution:

  1. Check pod logs for specific errors:
    
    oc logs vault-0 -n vault
    
  2. Common causes and fixes:
    • License issue: Verify the license secret exists and is correctly referenced
    • Storage issue: Check PVC status and storage class availability
    • Configuration error: Review the Raft configuration in values.yml
  3. Verify the license secret:
    
    oc get secret vault-ent-license -n vault
    oc describe secret vault-ent-license -n vault
    

Readiness probe failures

Symptoms:


Readiness probe failed: HTTP probe failed with statuscode: 503

Solution:

  1. This is expected behavior before initialization. Verify the readiness probe configuration in values.yml:
    
    readinessProbe:
      enabled: true
      path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
    
  2. The probe should accept 204 status codes for sealed/uninitialized states.
  3. If pods remain not ready after initialization and unsealing, check:
    
    oc exec -n vault vault-0 -- vault status
    

Liveness probe failures after deployment

Symptoms:


Liveness probe failed: Get "http://vault-0:8200/v1/sys/health": context deadline exceeded

Solution:

  1. Increase the initialDelaySeconds in the liveness probe:
    
    livenessProbe:
      enabled: true
      path: "/v1/sys/health?standbyok=true"
      initialDelaySeconds: 120
      periodSeconds: 10
      timeoutSeconds: 5
    
  2. Update the deployment:
    
    helm upgrade vault hashicorp/vault -n vault -f values.yml
    

"Vault is already initialized" error

Symptoms:


Error initializing: Error making API request.
Code: 400. Errors:
* Vault is already initialized

Solution: This is expected if Vault has already been initialized. To verify:


oc exec -n vault vault-0 -- vault status

If initialized: true, proceed to unsealing. If you need to reinitialize (data loss warning):

  1. Delete the PVCs and redeploy (this will erase all data)
  2. Or restore from a backup if available

Initialization hangs or times out

Symptoms: The vault operator init command hangs without output.

Solution:

  1. Check if the pod can reach the Vault API:
    
    oc exec -n vault vault-0 -- curl -s http://localhost:8200/v1/sys/health
    
  2. Verify the listener configuration in values.yml:
    
    listener "tcp" 
    {
      tls_disable = 1
      address = "[::]:8200"
      cluster_address = "[::]:8201"
    }
    
  3. Check for port conflicts or firewall rules blocking port 8200.

"Key not found" when unsealing

Symptoms:


Error unsealing: Error making API request.
Code: 400. Errors:
* invalid key

Solution:

  1. Verify you're using the correct unseal keys from the initialization output.
  2. Ensure you're not mixing keys from different Vault installations.
  3. Keys are case-sensitive; verify there are no copy-paste errors.

Vault reseals after pod restart

Symptoms: after a pod restart, Vault shows sealed: true again.

Solution: This is expected behavior. Vault must be manually unsealed after each restart. Options:

  1. Manual unsealing (as documented):
    
    oc exec -n vault vault-0 -- vault operator unseal <KEY_1>
    oc exec -n vault vault-0 -- vault operator unseal <KEY_2>
    oc exec -n vault vault-0 -- vault operator unseal <KEY_3>
    
  2. Auto-unseal (recommended for production):

    Configure auto-unseal using a KMS provider. Update values.yml with seal configuration:

    
    server:
      ha:
        raft:
          config: |
            seal "awskms" {
              region     = "us-east-1"
              kms_key_id = "your-kms-key-id"
            }
    

Cannot unseal standby nodes

Symptoms: unsealing vault-1 or vault-2 fails or shows errors.

Solution:

  1. Ensure vault-0 (leader) is unsealed first.
  2. Verify Raft cluster connectivity:
    
    oc exec -n vault vault-0 -- vault operator raft list-peers
    
  3. Check the internal service DNS resolution:
    
    oc exec -n vault vault-1 -- nslookup vault-0.vault-internal
    

Raft cluster not forming

Symptoms:


HA Enabled: true
HA Cluster: n/a

Solution:

  1. Verify the retry_join configuration in values.yml:
    
    retry_join {
      leader_api_addr = "http://vault-0.vault-internal:8200"
    }
    retry_join {
      leader_api_addr = "http://vault-1.vault-internal:8200"
    }
    retry_join {
      leader_api_addr = "http://vault-2.vault-internal:8200"
    }
    
  2. Check the headless service exists:
    
    oc get svc vault-internal -n vault
    
  3. Verify pod-to-pod connectivity:
    
    oc exec -n vault vault-1 -- curl -s http://vault-0.vault-internal:8200/v1/sys/health
    
  4. Check Raft logs:
    
    oc logs vault-0 -n vault | grep -i raft
    

Split-brain scenario

Symptoms: Multiple pods claim to be the active leader.

Solution:

  1. Check the Raft peer list on each pod:
    
    oc exec -n vault vault-0 -- vault operator raft list-peers
    
  2. If inconsistent, remove and re-add problematic peers:
    
    oc exec -n vault vault-0 -- vault operator raft remove-peer vault-2
    
  3. Restart the removed pod to rejoin:
    
    oc delete pod vault-2 -n vault
    

Leader election failures

Symptoms:


No active leader found

Solution:

  1. Ensure at least 2 out of 3 pods are running and unsealed (quorum requirement).
  2. Check network policies aren't blocking cluster communication on port 8201:
    
    oc get networkpolicies -n vault
    
  3. Verify the cluster_address is correctly configured:
    
    listener "tcp" {
      address = "[::]:8200"
      cluster_address = "[::]:8201"
    }
    

PVC not binding

Symptoms:


persistentvolumeclaim "data-vault-0" is pending

Solution:

  1. Check PVC status:
    
    oc get pvc -n vault
    oc describe pvc data-vault-0 -n vault
    
  2. Verify the storage class exists and is available:
    
    oc get storageclass
    
  3. If using ocs-storagecluster-ceph-rbd, ensure ODF is properly installed:
    
    oc get pods -n openshift-storage
    
  4. Check for sufficient storage capacity in the cluster.

Raft data corruption

Symptoms:


Error: failed to open raft logs: corruption detected

Solution:

  1. If you have backups, restore from the most recent snapshot:
    
    oc exec -n vault vault-0 -- vault operator raft snapshot restore /path/to/snapshot
    
  2. If no backups, you may need to reinitialize (data loss):
    • Delete all PVCs.
    • Redeploy Vault.
    • Reinitialize and reconfigure.
  3. Prevention: Set up regular snapshots:
    
    oc exec -n vault vault-0 -- vault operator raft snapshot save /vault/audit/snapshot-$(date +%Y%m%d).snap
    

Insufficient disk space

Symptoms:


Error: no space left on device

Solution:

  1. Check PVC size in values.yml:
    
    dataStorage:
      enabled: true
      size: 100Mi  # Increase this value
    
  2. For existing deployments, you may need to:
    • Create a snapshot.
    • Delete and recreate PVCs with larger size.
    • Restore from snapshot.
  3. Monitor disk usage:
    
    oc exec -n vault vault-0 -- df -h /vault/data
    

Certificate verification failures

Symptoms:


Error: x509: certificate signed by unknown authority

Solution:

  1. Verify the CA certificate is included in the TLS secret:
    
    oc get secret vault-tls -n vault -o yaml
    
  2. Ensure the ca.crt is specified in the Raft configuration:
    
    retry_join {
      leader_api_addr = "https://vault-0.vault-internal:8200"
      leader_ca_cert_file = "/vault/tls/ca.crt"
    }
    
  3. Verify certificate paths are correctly mounted:
    
    oc exec -n vault vault-0 -- ls -la /vault/tls/
    

Certificate expired

Symptoms:


Error: x509: certificate has expired

Solution:

  1. Generate new certificates with appropriate validity period.
  2. Update the secret:
    
    oc delete secret vault-tls -n vault
    oc create secret generic vault-tls \
      --from-file=vault.crt=/path/to/new/vault.crt \
      --from-file=vault.key=/path/to/new/vault.key \
      --from-file=ca.crt=/path/to/new/ca.crt \
      -n vault
    
  3. Restart Vault pods:
    
    oc rollout restart statefulset vault -n vault
    

TLS handshake failures

Symptoms:


Error: tls: handshake failure

Solution:

  1. Verify TLS is properly configured in the listener:
    
    listener "tcp" {
      tls_disable = 0
      tls_cert_file = "/vault/tls/vault.crt"
      tls_key_file = "/vault/tls/vault.key"
      tls_client_ca_file = "/vault/tls/ca.crt"
    }
    
  2. Check certificate permissions:
    
    oc exec -n vault vault-0 -- ls -la /vault/tls/
    

    Files should be readable (mode 0400 or 0444).

  3. Verify certificate and key match:
    
    oc exec -n vault vault-0 -- openssl x509 -noout -modulus -in /vault/tls/vault.crt | openssl md5
    oc exec -n vault vault-0 -- openssl rsa -noout -modulus -in /vault/tls/vault.key | openssl md5
    

    The MD5 hashes should match.

License secret not found

Symptoms:


Error: secret "vault-ent-license" not found

Solution:

  1. Create the license secret:
    
    oc create secret generic vault-ent-license \
      --from-file=license=/path/to/vault.hclic \
      -n vault
    
  2. Verify the secret exists:
    
    oc get secret vault-ent-license -n vault
    
  3. Ensure the secret name matches the configuration in values.yml:
    
    enterpriseLicense:
      secretName: "vault-ent-license"
      secretKey: "license"
    

Invalid or expired license

Symptoms:


Error: license is invalid or expired

Solution:

  1. Verify license validity:
    
    oc exec -n vault vault-0 -- vault read sys/license
    
  2. Contact HashiCorp support to obtain a valid license.
  3. Update the license secret:
    
    oc delete secret vault-ent-license -n vault
    oc create secret generic vault-ent-license \
      --from-file=license=/path/to/new/vault.hclic \
      -n vault
    
  4. Restart Vault pods:
    
    oc rollout restart statefulset vault -n vault
    

Cannot access Vault UI

Symptoms: Cannot reach the Vault UI through the OpenShift route.

Solution:

  1. Verify the UI is enabled in values.yml:
    
    ui:
      enabled: true
    
  2. Check if a route exists:
    
    oc get route -n vault
    
  3. If no route exists, create one:
    
    oc create route edge vault-ui \
      --service=vault \
      --port=8200 \
      -n vault
    
  4. Get the route URL:
    
    oc get route vault-ui -n vault -o jsonpath='{.spec.host}'
    

403 permission denied errors

Symptoms:


Error: permission denied

Solution:

  1. Ensure you're authenticated with a valid token:
    
    oc exec -n vault vault-0 -- vault login <ROOT_TOKEN>
    
  2. Check token capabilities:
    
    oc exec -n vault vault-0 -- vault token capabilities <TOKEN> <PATH>
    
  3. Review and update policies as needed:
    
    oc exec -n vault vault-0 -- vault policy list
    oc exec -n vault vault-0 -- vault policy read <POLICY_NAME>
    

DNS resolution failures

Symptoms:


Error: no such host

Solution:

  1. Verify the internal service exists:
    
    oc get svc vault-internal -n vault
    
  2. Test DNS resolution from within a pod:
    
    oc exec -n vault vault-0 -- nslookup vault-internal
    
  3. Check CoreDNS is functioning:
    
    oc get pods -n openshift-dns
    

Slow response times

Symptoms: API requests take longer than expected to complete.

Solution:

  1. Check pod resource usage:
    
    oc top pods -n vault
    
  2. Increase resource limits in values.yml:
    
    server:
      resources:
        requests:
          memory: "256Mi"
          cpu: "250m"
        limits:
          memory: "512Mi"
          cpu: "500m"
    
  3. Monitor Raft performance:
    
    oc exec -n vault vault-0 -- vault operator raft list-peers
    
  4. Check storage performance (IOPS, latency).

High memory usage

Symptoms: Vault pods consuming excessive memory, potentially leading to OOMKilled status.

Solution:

  1. Check current memory usage:
    
    oc exec -n vault vault-0 -- cat /proc/meminfo
    
  2. Increase memory limits:
    
    server:
      resources:
        limits:
          memory: "1Gi"
    
  3. Review audit log configuration; excessive logging can increase memory usage.
  4. Consider enabling audit log rotation:
    
    oc exec -n vault vault-0 -- vault audit enable file file_path=/vault/audit/audit.log
    

General diagnostics


# Check all Vault resources
oc get all -n vault

# View detailed pod information
oc describe pod vault-0 -n vault

# Check pod events
oc get events -n vault --sort-by='.lastTimestamp'

# View Vault logs
oc logs vault-0 -n vault --tail=100

# Check Vault status from all pods
for i in 0 1 2; do
  echo "=== vault-$i ==="
  oc exec -n vault vault-$i -- vault status
done

# Check Raft cluster health
oc exec -n vault vault-0 -- vault operator raft list-peers

# View Vault configuration
oc exec -n vault vault-0 -- cat /vault/config/extraconfig-from-values.hcl

Network diagnostics


# Test connectivity between pods
oc exec -n vault vault-1 -- curl -v http://vault-0.vault-internal:8200/v1/sys/health

# Check service endpoints
oc get endpoints vault-internal -n vault

# Verify network policies
oc get networkpolicies -n vault

Storage diagnostics


# Check PVC status
oc get pvc -n vault

# View PVC details
oc describe pvc data-vault-0 -n vault

# Check disk usage
oc exec -n vault vault-0 -- df -h

# List Raft snapshots
oc exec -n vault vault-0 -- ls -lh /vault/audit/

Getting help

If you continue to experience issues after following this troubleshooting guide:

  1. Check Vault Logs: Detailed error messages are often found in the pod logs.
    
    oc logs vault-0 -n vault --tail=200
    
  2. Review HashiCorp Documentation:
  3. Community Support:
  4. Enterprise Support: If you have a Vault Enterprise license, contact HashiCorp support with:
    • Vault version.
    • OpenShift version.
    • Architecture (s390x).
    • Detailed error messages.
    • Relevant logs and configuration.

Best practices to avoid common issues

  1. Always backup before making changes:
    
    oc exec -n vault vault-0 -- vault operator raft snapshot save /vault/audit/backup.snap
    
  2. Monitor Vault health regularly:
    • Set up monitoring and alerting.
    • Check seal status.
    • Monitor Raft cluster health.
  3. Keep unseal keys secure:
    • Store in a secure location (e.g., password manager, HSM).
    • Never commit to version control.
    • Consider using auto-unseal for production.
  4. Test disaster recovery procedures:
    • Practice unsealing.
    • Test snapshot restore.
    • Document recovery procedures.
  5. Use TLS in production:
    • Always enable TLS for production deployments.
    • Use valid certificates from a trusted CA.
    • Regularly rotate certificates before expiration.
  6. Plan for capacity:
    • Monitor storage usage.
    • Plan for growth.
    • Set appropriate resource limits.
  7. Keep Vault updated:
    • Review release notes.
    • Test updates in non-production first.
    • Follow HashiCorp's upgrade guides.

Troubleshooting - Vault Secrets Operator

VSO controller not starting

Check the controller logs:


oc logs -n vault-operator deployment/vault-secrets-operator-controller-manager -c manager

VaultAuth authentication failures

Verify the AppRole credentials:


oc describe vaultauth vault-auth -n vault-operator

Check if the secret exists and contains the correct Secret ID:


oc get secret approle-secret -n vault-operator -o yaml

Secret not syncing

Check the VaultStaticSecret status:


oc describe vaultstaticsecret vault-static-secret-v2 -n vault-operator

Verify the path exists in Vault:


oc exec -it vault-0 -n vault -- vault kv get kvv2/myapp/config

Network connectivity issues

Test connectivity from the VSO pod to Vault:


oc exec -it deployment/vault-secrets-operator-controller-manager -n vault-operator -c manager -- curl http://vault.vault.svc.cluster.local:8200/v1/sys/health

Troubleshooting Kubernetes authentication

Authentication failures

Check VaultAuth status:


oc describe vaultauth vault-auth -n vault-operator

Common issues:

  • ServiceAccount not found:
    
    oc get sa vault-secrets-operator -n vault-operator
    
  • Role not configured correctly:
    
    oc exec -it vault-0 -n vault -- vault read auth/kubernetes/role/vso-role
    
  • Kubernetes API connectivity:
    
    oc exec -it vault-0 -n vault -- vault read auth/kubernetes/config
    

Token issues

Check if VSO can obtain tokens:


# View VSO controller logs
oc logs -n vault-operator deployment/vault-secrets-operator-controller-manager -c manager --tail=100

Look for authentication-related errors:


Error authenticating to Vault: permission denied
Error: namespace not authorized

Verify serviceaccount token

Check if the serviceaccount has a valid token:


oc get serviceaccount vault-secrets-operator -n vault-operator -o yaml

Migration from approle to Kubernetes auth

If you're migrating from AppRole to Kubernetes auth:

  1. Keep AppRole active during migration.
  2. Create Kubernetes auth configuration in parallel.
  3. Test with a single VaultStaticSecret first.
  4. Gradually migrate other secrets.
  5. Remove AppRole only after all secrets are migrated.
  6. Clean up AppRole secrets and configurations.

# After successful migration, disable AppRole
oc exec -it vault-0 -n vault -- /bin/sh
export VAULT_ADDR=http://127.0.0.1:8200
vault login <root_token>

# Delete AppRole role
vault delete auth/approle/role/myapp-role

# Optional: Disable AppRole if not used elsewhere
# vault auth disable approle

exit

Troubleshooting Vault SSH Secret Engine

Certificate rejected

If SSH rejects the certificate:

  1. Verify the CA public key is correctly configured on the target server.
  2. Check certificate validity period hasn't expired.
  3. Ensure valid_principals matches the username you're connecting as.
  4. Verify SSH daemon configuration includes TrustedUserCAKeys.

Permission denied

If you get permission denied:

  1. Check the allowed_users in the Vault role configuration.
  2. Verify your Vault token has permission to sign certificates.
  3. Ensure the certificate file has correct permissions (600).

Certificate expired

Certificates have a TTL. If expired:

  1. Request a new signed certificate from Vault.
  2. Consider increasing the TTL in the role configuration (balance security vs convenience).

Troubleshooting Vault Disaster Recovery

Use these commands to troubleshoot and diagnose issues with Vault DR replication:

Check replication status:


vault read sys/replication/dr/status

Check Raft cluster health:


vault operator raft list-peers
vault operator raft autopilot state

Remove a dead peer manually (only when Autopilot cannot):


vault operator raft remove-peer <node-id>

Check Vault pod labels (useful for automation):


oc get pods -n vault --show-labels

Verify DNS resolution for Raft peers:


oc exec vault-0 -n vault -- nslookup vault-0.vault-internal.vault.svc

View operational logs:


oc logs -n vault vault-0

Execute commands in a pod for CLI operations:


oc exec -n vault -it vault-0 -- /bin/sh

Performance replication

Deploy performance replication (PR) when you need to serve Vault client traffic across multiple regions while sharing the same secret material and configuration.

Vault performance replication streams secret data in near-real-time from a primary cluster to secondary clusters. Performance replication secondary clusters manage their own tokens and leases, handle read requests locally, and forward write requests to the primary cluster. This enables low-latency secret access across multiple sites.