Cloudkit issues on GCP
This topic describes the common issues and workarounds that you might encounter while running cloudkit on Google Cloud Platform (GCP).
Create cluster issues
- Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
- Review the cloudkit inputs to ensure that all values are set correctly.
Quota issues
The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the GCP account.
For example, if the routers quota is exceeded, the following error message is logged:
Error waiting to create Router: Error waiting for Creating Router: Quota 'ROUTERS' exceeded. Limit: 20.0 globally.
Fix: Increase the required quota or free some required resource.
- Check your GCP credentials. Ensure that the GCP credentials used to provision resources through
cloudkit are valid and have the appropriate permissions. Verify this by checking the GCP credential
JSON file used by cloudkit and confirming that they have the necessary
permissions.
google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information
- Cluster creation failure due to transient network errors:
- In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
- If this is encountered, rerun the
./cloudkit
command with the same parameters. - The cloudkit logs might list something like:
Error waiting for instance to create: error while retrieving operation: Get \"https://compute.googleapis.com/compute/v1/projects/<PROJECT_ID>/zones/us-east1-b/operations/operation-1696304597507-1111111-e49f6fc6-a5b7ecc4?alt=json&prettyPrint=false\": http2: client connection lost
- If the problem still persists after rerunning the command, delete the existing cluster and create a new cluster.
- If an IBM Storage Scale related problem is suspected, collect data by running a gpfs.snap. Upload this gpfs.snap to the IBM Storage Scale support ticket that is opened.
- Plan your network infrastructure to ensure a reliable communication between installer node and
cloud. In jump host based connectivity, it could take little longer for
ssh
to reach the node, if there is a network drop, it is recommended to re-run.
Plug-in load errors
Cluster creation fails while the terraform plug-in is loading and the following error message appears:
Error: Failed to load plugin schemas
Error: while loading schemas for plugin components: Failed to obtain schema: Could not load the schema for provider
Fix: Rerun the ./cloudkit command with the same parameters.
Delete cluster issue
- If any GCP resources have been provisioned into the VPC outside of the cloudkit, the cloudkit delete command will be unable to delete these manually created resources. In this case, all resources created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be executed.
When this issue occurs, the following message may be displayed:
I: Destroy all remote objects managed by terraform (instance_template) configuration.
I: Destroy all remote objects managed by terraform (bastion_template) configuration.
I: Destroy all remote objects managed by terraform (vpc_template) configuration.
E: Delete cluster received error: exit status 1
Fix: Manually clean the cloud resources and contact IBM Support.
Grant filesystem issue
- Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
- If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.
Transient network errors
- In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
- If this is encountered, rerun the
./cloudkit
command with the same parameters. - The cloudkit logs might list something
like:
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
Revoke file system issue
- Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
- If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.
Transient Network errors
- In some circumstances it might be possible to encounter latent network issues when deploying new cluster resources.
- If this is encountered, rerun the
./cloudkit
command with the same parameters. - The cloudkit logs might list something
like:
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
Edit cluster issue
Quorum node issues
After running edit on any cluster, the following GPFS TIPS is displayed if you run the mmhealth node show command:
GPFS TIPS 1 hour ago callhome_not_enabled, quorum_too_little_nodes
This means that the edit operation is not assigning any new quorum node.
Fix: You need to manually assign additional quorum node from the newly added nodes.