Troubleshooting
This guide provides solutions to common issues encountered during installation, configuration, and operation of HashiCorp Vault on OpenShift Container Platform with s390x architecture.
HashiCorp Vault on OpenShift (s390x)
Helm installation fails with "repository not found"
Symptoms:
Error: failed to download "hashicorp/vault"
Solution:
- Verify the Helm repository is added correctly:
helm repo list - If not present, add the repository:
helm repo add hashicorp https://helm.releases.hashicorp.com helm repo update - Retry the installation.
Image pull errors for s390x architecture
Symptoms:
Failed to pull image "icr.io/ibm-vault-for-z:1.19.12-ent": rpc error: code = Unknown desc = Error reading manifest
Solution:
- Verify image repository access:
oc get pods -n vault oc describe pod vault-0 -n vault - Check if image pull secrets are required:
oc create secret docker-registry ibm-registry \ --docker-server=icr.io \ --docker-username=<username> \ --docker-password=<password> \ -n vault - Update values.yml to reference the secret:
server: image: repository: "icr.io/ibm-vault-for-z" tag: "1.19.12-ent" pullPolicy: IfNotPresent imagePullSecrets: - name: ibm-registry - Upgrade the Helm release:
helm upgrade vault hashicorp/vault -n vault -f values.yml
Namespace does not exist
Symptoms:
Error: namespaces "vault" not found
Solution: Create the namespace before installation:
oc new-project vault
Pods stuck in crashloopbackoff
Symptoms:
NAME READY STATUS RESTARTS
vault-0 0/1 CrashLoopBackOff 5
Solution:
- Check pod logs for specific errors:
oc logs vault-0 -n vault - Common causes and fixes:
- License issue: Verify the license secret exists and is correctly referenced
- Storage issue: Check PVC status and storage class availability
- Configuration error: Review the Raft configuration in values.yml
- Verify the license secret:
oc get secret vault-ent-license -n vault oc describe secret vault-ent-license -n vault
Readiness probe failures
Symptoms:
Readiness probe failed: HTTP probe failed with statuscode: 503
Solution:
- This is expected behavior before initialization. Verify the readiness probe configuration in
values.yml:
readinessProbe: enabled: true path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204" - The probe should accept 204 status codes for sealed/uninitialized states.
- If pods remain not ready after initialization and unsealing, check:
oc exec -n vault vault-0 -- vault status
Liveness probe failures after deployment
Symptoms:
Liveness probe failed: Get "http://vault-0:8200/v1/sys/health": context deadline exceeded
Solution:
- Increase the
initialDelaySecondsin the liveness probe:livenessProbe: enabled: true path: "/v1/sys/health?standbyok=true" initialDelaySeconds: 120 periodSeconds: 10 timeoutSeconds: 5 - Update the deployment:
helm upgrade vault hashicorp/vault -n vault -f values.yml
"Vault is already initialized" error
Symptoms:
Error initializing: Error making API request.
Code: 400. Errors:
* Vault is already initialized
Solution: This is expected if Vault has already been initialized. To verify:
oc exec -n vault vault-0 -- vault status
If initialized: true, proceed to unsealing. If you need to reinitialize (data loss warning):
- Delete the PVCs and redeploy (this will erase all data)
- Or restore from a backup if available
Initialization hangs or times out
Symptoms: The vault operator init command hangs without output.
Solution:
- Check if the pod can reach the Vault API:
oc exec -n vault vault-0 -- curl -s http://localhost:8200/v1/sys/health - Verify the listener configuration in values.yml:
listener "tcp" { tls_disable = 1 address = "[::]:8200" cluster_address = "[::]:8201" } - Check for port conflicts or firewall rules blocking port 8200.
"Key not found" when unsealing
Symptoms:
Error unsealing: Error making API request.
Code: 400. Errors:
* invalid key
Solution:
- Verify you're using the correct unseal keys from the initialization output.
- Ensure you're not mixing keys from different Vault installations.
- Keys are case-sensitive; verify there are no copy-paste errors.
Vault reseals after pod restart
Symptoms: after a pod restart, Vault shows sealed: true again.
Solution: This is expected behavior. Vault must be manually unsealed after each restart. Options:
- Manual unsealing (as documented):
oc exec -n vault vault-0 -- vault operator unseal <KEY_1> oc exec -n vault vault-0 -- vault operator unseal <KEY_2> oc exec -n vault vault-0 -- vault operator unseal <KEY_3> - Auto-unseal (recommended for production):
Configure auto-unseal using a KMS provider. Update values.yml with seal configuration:
server: ha: raft: config: | seal "awskms" { region = "us-east-1" kms_key_id = "your-kms-key-id" }
Cannot unseal standby nodes
Symptoms: unsealing vault-1 or vault-2 fails or shows errors.
Solution:
- Ensure
vault-0(leader) is unsealed first. - Verify Raft cluster connectivity:
oc exec -n vault vault-0 -- vault operator raft list-peers - Check the internal service DNS resolution:
oc exec -n vault vault-1 -- nslookup vault-0.vault-internal
Raft cluster not forming
Symptoms:
HA Enabled: true
HA Cluster: n/a
Solution:
- Verify the
retry_joinconfiguration in values.yml:retry_join { leader_api_addr = "http://vault-0.vault-internal:8200" } retry_join { leader_api_addr = "http://vault-1.vault-internal:8200" } retry_join { leader_api_addr = "http://vault-2.vault-internal:8200" } - Check the headless service exists:
oc get svc vault-internal -n vault - Verify pod-to-pod connectivity:
oc exec -n vault vault-1 -- curl -s http://vault-0.vault-internal:8200/v1/sys/health - Check Raft logs:
oc logs vault-0 -n vault | grep -i raft
Split-brain scenario
Symptoms: Multiple pods claim to be the active leader.
Solution:
- Check the Raft peer list on each pod:
oc exec -n vault vault-0 -- vault operator raft list-peers - If inconsistent, remove and re-add problematic peers:
oc exec -n vault vault-0 -- vault operator raft remove-peer vault-2 - Restart the removed pod to rejoin:
oc delete pod vault-2 -n vault
Leader election failures
Symptoms:
No active leader found
Solution:
- Ensure at least 2 out of 3 pods are running and unsealed (quorum requirement).
- Check network policies aren't blocking cluster communication on port 8201:
oc get networkpolicies -n vault - Verify the
cluster_addressis correctly configured:listener "tcp" { address = "[::]:8200" cluster_address = "[::]:8201" }
PVC not binding
Symptoms:
persistentvolumeclaim "data-vault-0" is pending
Solution:
- Check PVC status:
oc get pvc -n vault oc describe pvc data-vault-0 -n vault - Verify the storage class exists and is available:
oc get storageclass - If using
ocs-storagecluster-ceph-rbd, ensure ODF is properly installed:oc get pods -n openshift-storage - Check for sufficient storage capacity in the cluster.
Raft data corruption
Symptoms:
Error: failed to open raft logs: corruption detected
Solution:
- If you have backups, restore from the most recent snapshot:
oc exec -n vault vault-0 -- vault operator raft snapshot restore /path/to/snapshot - If no backups, you may need to reinitialize (data loss):
- Delete all PVCs.
- Redeploy Vault.
- Reinitialize and reconfigure.
- Prevention: Set up regular snapshots:
oc exec -n vault vault-0 -- vault operator raft snapshot save /vault/audit/snapshot-$(date +%Y%m%d).snap
Insufficient disk space
Symptoms:
Error: no space left on device
Solution:
- Check PVC size in values.yml:
dataStorage: enabled: true size: 100Mi # Increase this value - For existing deployments, you may need to:
- Create a snapshot.
- Delete and recreate PVCs with larger size.
- Restore from snapshot.
- Monitor disk usage:
oc exec -n vault vault-0 -- df -h /vault/data
Certificate verification failures
Symptoms:
Error: x509: certificate signed by unknown authority
Solution:
- Verify the CA certificate is included in the TLS secret:
oc get secret vault-tls -n vault -o yaml - Ensure the
ca.crtis specified in the Raft configuration:retry_join { leader_api_addr = "https://vault-0.vault-internal:8200" leader_ca_cert_file = "/vault/tls/ca.crt" } - Verify certificate paths are correctly mounted:
oc exec -n vault vault-0 -- ls -la /vault/tls/
Certificate expired
Symptoms:
Error: x509: certificate has expired
Solution:
- Generate new certificates with appropriate validity period.
- Update the secret:
oc delete secret vault-tls -n vault oc create secret generic vault-tls \ --from-file=vault.crt=/path/to/new/vault.crt \ --from-file=vault.key=/path/to/new/vault.key \ --from-file=ca.crt=/path/to/new/ca.crt \ -n vault - Restart Vault pods:
oc rollout restart statefulset vault -n vault
TLS handshake failures
Symptoms:
Error: tls: handshake failure
Solution:
- Verify TLS is properly configured in the listener:
listener "tcp" { tls_disable = 0 tls_cert_file = "/vault/tls/vault.crt" tls_key_file = "/vault/tls/vault.key" tls_client_ca_file = "/vault/tls/ca.crt" } - Check certificate permissions:
oc exec -n vault vault-0 -- ls -la /vault/tls/Files should be readable (mode 0400 or 0444).
- Verify certificate and key match:
oc exec -n vault vault-0 -- openssl x509 -noout -modulus -in /vault/tls/vault.crt | openssl md5 oc exec -n vault vault-0 -- openssl rsa -noout -modulus -in /vault/tls/vault.key | openssl md5The MD5 hashes should match.
License secret not found
Symptoms:
Error: secret "vault-ent-license" not found
Solution:
- Create the license secret:
oc create secret generic vault-ent-license \ --from-file=license=/path/to/vault.hclic \ -n vault - Verify the secret exists:
oc get secret vault-ent-license -n vault - Ensure the secret name matches the configuration in values.yml:
enterpriseLicense: secretName: "vault-ent-license" secretKey: "license"
Invalid or expired license
Symptoms:
Error: license is invalid or expired
Solution:
- Verify license validity:
oc exec -n vault vault-0 -- vault read sys/license - Contact HashiCorp support to obtain a valid license.
- Update the license secret:
oc delete secret vault-ent-license -n vault oc create secret generic vault-ent-license \ --from-file=license=/path/to/new/vault.hclic \ -n vault - Restart Vault pods:
oc rollout restart statefulset vault -n vault
Cannot access Vault UI
Symptoms: Cannot reach the Vault UI through the OpenShift route.
Solution:
- Verify the UI is enabled in values.yml:
ui: enabled: true - Check if a route exists:
oc get route -n vault - If no route exists, create one:
oc create route edge vault-ui \ --service=vault \ --port=8200 \ -n vault - Get the route URL:
oc get route vault-ui -n vault -o jsonpath='{.spec.host}'
403 permission denied errors
Symptoms:
Error: permission denied
Solution:
- Ensure you're authenticated with a valid token:
oc exec -n vault vault-0 -- vault login <ROOT_TOKEN> - Check token capabilities:
oc exec -n vault vault-0 -- vault token capabilities <TOKEN> <PATH> - Review and update policies as needed:
oc exec -n vault vault-0 -- vault policy list oc exec -n vault vault-0 -- vault policy read <POLICY_NAME>
DNS resolution failures
Symptoms:
Error: no such host
Solution:
- Verify the internal service exists:
oc get svc vault-internal -n vault - Test DNS resolution from within a pod:
oc exec -n vault vault-0 -- nslookup vault-internal - Check CoreDNS is functioning:
oc get pods -n openshift-dns
Slow response times
Symptoms: API requests take longer than expected to complete.
Solution:
- Check pod resource usage:
oc top pods -n vault - Increase resource limits in values.yml:
server: resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" - Monitor Raft performance:
oc exec -n vault vault-0 -- vault operator raft list-peers - Check storage performance (IOPS, latency).
High memory usage
Symptoms: Vault pods consuming excessive memory, potentially leading to OOMKilled status.
Solution:
- Check current memory usage:
oc exec -n vault vault-0 -- cat /proc/meminfo - Increase memory limits:
server: resources: limits: memory: "1Gi" - Review audit log configuration; excessive logging can increase memory usage.
- Consider enabling audit log rotation:
oc exec -n vault vault-0 -- vault audit enable file file_path=/vault/audit/audit.log
General diagnostics
# Check all Vault resources
oc get all -n vault
# View detailed pod information
oc describe pod vault-0 -n vault
# Check pod events
oc get events -n vault --sort-by='.lastTimestamp'
# View Vault logs
oc logs vault-0 -n vault --tail=100
# Check Vault status from all pods
for i in 0 1 2; do
echo "=== vault-$i ==="
oc exec -n vault vault-$i -- vault status
done
# Check Raft cluster health
oc exec -n vault vault-0 -- vault operator raft list-peers
# View Vault configuration
oc exec -n vault vault-0 -- cat /vault/config/extraconfig-from-values.hcl
Network diagnostics
# Test connectivity between pods
oc exec -n vault vault-1 -- curl -v http://vault-0.vault-internal:8200/v1/sys/health
# Check service endpoints
oc get endpoints vault-internal -n vault
# Verify network policies
oc get networkpolicies -n vault
Storage diagnostics
# Check PVC status
oc get pvc -n vault
# View PVC details
oc describe pvc data-vault-0 -n vault
# Check disk usage
oc exec -n vault vault-0 -- df -h
# List Raft snapshots
oc exec -n vault vault-0 -- ls -lh /vault/audit/
Getting help
If you continue to experience issues after following this troubleshooting guide:
- Check Vault Logs: Detailed error messages are often found in the pod
logs.
oc logs vault-0 -n vault --tail=200 - Review HashiCorp Documentation:
- Community Support:
- Enterprise Support: If you have a Vault Enterprise license, contact HashiCorp support with:
- Vault version.
- OpenShift version.
- Architecture (s390x).
- Detailed error messages.
- Relevant logs and configuration.
Best practices to avoid common issues
- Always backup before making changes:
oc exec -n vault vault-0 -- vault operator raft snapshot save /vault/audit/backup.snap - Monitor Vault health regularly:
- Set up monitoring and alerting.
- Check seal status.
- Monitor Raft cluster health.
- Keep unseal keys secure:
- Store in a secure location (e.g., password manager, HSM).
- Never commit to version control.
- Consider using auto-unseal for production.
- Test disaster recovery procedures:
- Practice unsealing.
- Test snapshot restore.
- Document recovery procedures.
- Use TLS in production:
- Always enable TLS for production deployments.
- Use valid certificates from a trusted CA.
- Regularly rotate certificates before expiration.
- Plan for capacity:
- Monitor storage usage.
- Plan for growth.
- Set appropriate resource limits.
- Keep Vault updated:
- Review release notes.
- Test updates in non-production first.
- Follow HashiCorp's upgrade guides.
Troubleshooting - Vault Secrets Operator
VSO controller not starting
Check the controller logs:
oc logs -n vault-operator deployment/vault-secrets-operator-controller-manager -c manager
VaultAuth authentication failures
Verify the AppRole credentials:
oc describe vaultauth vault-auth -n vault-operator
Check if the secret exists and contains the correct Secret ID:
oc get secret approle-secret -n vault-operator -o yaml
Secret not syncing
Check the VaultStaticSecret status:
oc describe vaultstaticsecret vault-static-secret-v2 -n vault-operator
Verify the path exists in Vault:
oc exec -it vault-0 -n vault -- vault kv get kvv2/myapp/config
Network connectivity issues
Test connectivity from the VSO pod to Vault:
oc exec -it deployment/vault-secrets-operator-controller-manager -n vault-operator -c manager -- curl http://vault.vault.svc.cluster.local:8200/v1/sys/health
Troubleshooting Kubernetes authentication
Authentication failures
Check VaultAuth status:
oc describe vaultauth vault-auth -n vault-operator
Common issues:
- ServiceAccount not found:
oc get sa vault-secrets-operator -n vault-operator - Role not configured correctly:
oc exec -it vault-0 -n vault -- vault read auth/kubernetes/role/vso-role - Kubernetes API connectivity:
oc exec -it vault-0 -n vault -- vault read auth/kubernetes/config
Token issues
Check if VSO can obtain tokens:
# View VSO controller logs
oc logs -n vault-operator deployment/vault-secrets-operator-controller-manager -c manager --tail=100
Look for authentication-related errors:
Error authenticating to Vault: permission denied
Error: namespace not authorized
Verify serviceaccount token
Check if the serviceaccount has a valid token:
oc get serviceaccount vault-secrets-operator -n vault-operator -o yaml
Migration from approle to Kubernetes auth
If you're migrating from AppRole to Kubernetes auth:
- Keep
AppRoleactive during migration. - Create Kubernetes auth configuration in parallel.
- Test with a single
VaultStaticSecretfirst. - Gradually migrate other secrets.
- Remove
AppRoleonly after all secrets are migrated. - Clean up
AppRolesecrets and configurations.
# After successful migration, disable AppRole
oc exec -it vault-0 -n vault -- /bin/sh
export VAULT_ADDR=http://127.0.0.1:8200
vault login <root_token>
# Delete AppRole role
vault delete auth/approle/role/myapp-role
# Optional: Disable AppRole if not used elsewhere
# vault auth disable approle
exit
Troubleshooting Vault SSH Secret Engine
Certificate rejected
If SSH rejects the certificate:
- Verify the CA public key is correctly configured on the target server.
- Check certificate validity period hasn't expired.
- Ensure
valid_principalsmatches the username you're connecting as. - Verify SSH daemon configuration includes
TrustedUserCAKeys.
Permission denied
If you get permission denied:
- Check the
allowed_usersin the Vault role configuration. - Verify your Vault token has permission to sign certificates.
- Ensure the certificate file has correct permissions (600).
Certificate expired
Certificates have a TTL. If expired:
- Request a new signed certificate from Vault.
- Consider increasing the TTL in the role configuration (balance security vs convenience).
Troubleshooting Vault Disaster Recovery
Use these commands to troubleshoot and diagnose issues with Vault DR replication:
Check replication status:
vault read sys/replication/dr/status
Check Raft cluster health:
vault operator raft list-peers
vault operator raft autopilot state
Remove a dead peer manually (only when Autopilot cannot):
vault operator raft remove-peer <node-id>
Check Vault pod labels (useful for automation):
oc get pods -n vault --show-labels
Verify DNS resolution for Raft peers:
oc exec vault-0 -n vault -- nslookup vault-0.vault-internal.vault.svc
View operational logs:
oc logs -n vault vault-0
Execute commands in a pod for CLI operations:
oc exec -n vault -it vault-0 -- /bin/sh
Performance replication
Deploy performance replication (PR) when you need to serve Vault client traffic across multiple regions while sharing the same secret material and configuration.
Vault performance replication streams secret data in near-real-time from a primary cluster to secondary clusters. Performance replication secondary clusters manage their own tokens and leases, handle read requests locally, and forward write requests to the primary cluster. This enables low-latency secret access across multiple sites.