BOSH virtual machine shows unresponsive
Symptoms (Detection)
- A Prometheus alert shows a virtual machine with a
failuremessage. - The
bosh -e IBMCloudPrivate vmsshows a virtual machine (job) with afailuremessage.
Determine if disk usage is at 100%
- Log in to the bosh client.
- Check the disk usage by running the following command:
bosh -e IBMCloudPrivate vms --vitals - Connect to the virtual machine in question by using bosh, where
0is the instance of the virtual machine in question:bosh -e IBMCloudPrivate -d Bluemix ssh JOB_NAME/0Useful commands
The following commands can be run when you are connected to the virtual machine:df -k # List all disk usage for the virtual machine du --max-depth=1 # List the sizes for all files and directories in the current location.Fixing persistent disk usage at 100% for
ccdb_nganduaadb[ /var/vcap/store ]. - Run the following command to become the
vcapuser:sudo su vcap -
Clean up the transaction logs:
/var/vcap/packages/postgres-9.4.9/bin/pg_resetxlog -f /var/vcap/store/postgres/postgres-9.4.9/Note: This command can take a while, but will reduce the size of
/var/vcap/store. -
Validate that the disk usage is no longer 100%.
Fixing Ephemeral disk usage at 100% [ /var/vcap/data ]
- Log in as a root user.
sudo su - - On the virtual machine, issue the following command to change the directory:
cd /var/vcap/data. - Follow the largest file sizes to determine whether any large files or directories can be removed.
du --max-depth=1 - Most files in
/var/vcap/data/sys/logcan be removed. If the logs are required, copy them to an external location, then remove the local copies. - Validate that the disk usage is no longer 100%.
Alternative fix for Ephemeral disk usage at 100% [ /var/vcap/data ]
NOTE: This solution fixes only the ephemeral disk and must not be run if the persistent disk usage is 100%
- Because the ephemeral disk does not contain persistent data, the virtual machine can be rebuilt.
- Re-create the virtual machine by using bosh.
bosh -e IBMCloudPrivate -d Bluemix recreate JOB_NAME/INDEX # Example: JOB_NAME/INDEX = ccdb_ng/0 Continue? [yN]: y - After the virtual machine is re-created, check that the disk usage is no longer 100%.