BOSH virtual machine shows unresponsive

Symptoms (Detection)

  1. A Prometheus alert shows a virtual machine with a failure message.
  2. The bosh -e IBMCloudPrivate vms shows a virtual machine (job) with a failure message.

Determine if disk usage is at 100%

  1. Log in to the bosh client.
  2. Check the disk usage by running the following command:
    bosh -e IBMCloudPrivate vms --vitals
    
  3. Connect to the virtual machine in question by using bosh, where 0 is the instance of the virtual machine in question:
    bosh -e IBMCloudPrivate -d Bluemix ssh JOB_NAME/0
    

    Useful commands

    The following commands can be run when you are connected to the virtual machine:
    df -k             # List all disk usage for the virtual machine
    du --max-depth=1  # List the sizes for all files and directories in the current location.
    

    Fixing persistent disk usage at 100% for ccdb_ng and uaadb [ /var/vcap/store ].

  4. Run the following command to become the vcap user:
    sudo su vcap
    
  5. Clean up the transaction logs:

    /var/vcap/packages/postgres-9.4.9/bin/pg_resetxlog -f /var/vcap/store/postgres/postgres-9.4.9/
    

    Note: This command can take a while, but will reduce the size of /var/vcap/store.

  6. Validate that the disk usage is no longer 100%.

Fixing Ephemeral disk usage at 100% [ /var/vcap/data ]

  1. Log in as a root user.
    sudo su -
    
  2. On the virtual machine, issue the following command to change the directory: cd /var/vcap/data.
  3. Follow the largest file sizes to determine whether any large files or directories can be removed.
    du --max-depth=1
    
  4. Most files in /var/vcap/data/sys/log can be removed. If the logs are required, copy them to an external location, then remove the local copies.
  5. Validate that the disk usage is no longer 100%.

Alternative fix for Ephemeral disk usage at 100% [ /var/vcap/data ]

NOTE: This solution fixes only the ephemeral disk and must not be run if the persistent disk usage is 100%

  1. Because the ephemeral disk does not contain persistent data, the virtual machine can be rebuilt.
  2. Re-create the virtual machine by using bosh.
    bosh -e IBMCloudPrivate -d Bluemix recreate JOB_NAME/INDEX  # Example: JOB_NAME/INDEX = ccdb_ng/0
    Continue? [yN]: y
    
  3. After the virtual machine is re-created, check that the disk usage is no longer 100%.