Troubleshooting
Problem
At a sufficient level of concurrent deploys using a VMAX as the backend storage, it is possible to encounter some deploy failures because the volume clone operations take longer than the compute host (nova) waits for them to complete. By default, nova compute hosts will check 600 times, at an interval of every 3 seconds, for the volume clone to complete before it times out. Therefore, it waits 30 minutes, plus loop processing time. Because create and delete volume operations are serialized on the VMAX, sufficient concurrency can result in deploy failures.
Symptom
Here is an example of the error message seen. This is from the nova compute log; however, a portion of the message would be displayed in the console messages queue as well:
nova-compute.log.1:2016-05-18 03:06:30.324 99255 ERROR nova.compute.manager [instance: c8715824-2991-42cb-8602-8b27ef0367b4] VolumeNotCreated: Volume <vol_id> did not finish being created even after we waited <xxx> seconds or 601 attempts. And its status is creating.
Where <xxx> is a number larger than 1800, assuming the retry and interval values have not been changed from their default settings.
Cause
PowerVC Knowledge Center recommends a maximum of eight concurrent operations when the backend storage type is VMAX. Because create and delete volume operations are serialized by the VMAX (along with other configuration change operations), there will be a point where volume creation or deletion will take longer than 30 minutes. Generally, eight concurrent operations will finish within the 30-minute window. If one assumes 3.5 minutes per create or delete, then eight issued simultaneously would result in the longest operation taking 28 minutes. However, if it actually takes 4 minutes for each create/clone operation, for example, then one of those will not finish until more than 32 minutes have passed, resulting in the deployment timeout while waiting for the volume to be cloned.
Environment
o Standard HMC or NovaLink
o Backend storage is a VMAX array
Any release could be impacted by this type of failure, though the example given here is an error message coming from 1.3.1.1.
Diagnosing The Problem
Check the PowerVC compute and cinder logs for messages that relate to timing out a request after waiting for x many seconds or attempts. An example is given in the Symptom section above.
Resolving The Problem
There are two main ways to resolve the problem:
1. Lower the level of concurrency when issuing deployments or VM/volume deletion requests.
2. Increase the number of attempts for checking that an operation has completed.
For Item 2 above, OpenStack offers a wide range of configuration options regarding operation timeout, so the solution depends on the type of timeout you want to change. This technical note focuses on the volume creation/clone failure resulting from not waiting long enough, so the options will be reviewed pertaining to that.
1. | Check and possibly edit these settings in each compute host’s configuration file. This is in /etc/nova/nova.conf (NovalLink hosts) or /etc/nova/nova-<compute_hosts>.conf for HMC managed hosts. Under the [DEFAULT] configuration section, the following two settings are shown with their default values: o block_device_allocate_retries = 600 o block_device_allocate_retries_interval = 3 This computes to a timeout of 600 * 3 = 1800 seconds, or 30 minutes. To increase this timeout to 90 minutes, for example, one could change the retry setting as follows: block_device_allocate_retries = 1800 Now the compute host will wait 1800 * 3, or 5400 seconds (minimum) before timing out. Do this for every applicable compute host that is managed. After the configuration file is edited, the compute service must be restarted in order for the changes to take effect. For a NovaLink-managed compute host, for example, the command to restart its process is: systemctl restart openstack-nova-compute |
2. | You also need to check the timeout options for the VMAX cinder driver when it waits for the SMI-S provider to respond to a request. These values are stored in the following xml file: /etc/cinder/cinder_emc_config_<vmax_array_id>.xml The options to check and adjust are as follows: <Interval>5</Interval><Retries>720</Retries> The example values here are valid defaults for a new VMAX provider registered under PowerVC 1.3.1.1; however, the defaults may vary. Note that in this case, the volume driver will wait 5 * 720, or 1 hour before giving up on the SMI-S provider. If the nova-compute host settings are for 1 hour or less, this would be satisfactory. However, for the 90-minute timeout example, you would want to increase the Retries value here such that it would not time out before the 90-minute window runs out: <Interval>5</Interval><Retries>1080</Retries> These changes take effect immediately. A service restart is not required. |
3. | Finally, you might need to change the RPC timeout value. If a timeout is seen in the /var/log/cinder/api.log file, you will see a log message containing the following text: ERROR ...snip... "Caught error: Timed out waiting for a reply to message ID <uuid>" This means that the cinder RPC timeout has been exceeded. The cinder API process gives up waiting on the volume driver process to respond. Check for this configuration property in /etc/cinder/cinder.conf, under the [DEFAULT] section: rpc_response_timeout = 1800 The value shown here is the default for PowerVC 1.3.1.1; however, this could vary with other releases. Continuing the 90-minute timeout example, you could change this to have a value of 5400: rpc_response_timeout = 5400 After making this change, restart the cinder api service for the new setting to take effect. For example: systemctl restart openstack-cinder-api |
Was this topic helpful?
Document Information
More support for:
PowerVC Standard Edition
Software version:
Version Independent
Operating system(s):
Linux
Document number:
667151
Modified date:
17 June 2018
UID
nas8N1021384