Cloudkit issues on GCP
This topic describes the common issues and workarounds that you might encounter while running cloudkit on Google Cloud Platform (GCP).
Create cluster issues
- Quota issues
-
The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the GCP account.
For example, if the routers quota is exceeded, the following error message is logged:
Error waiting to create Router: Error waiting for Creating Router: Quota 'ROUTERS' exceeded. Limit: 20.0 globally.
Fix: Increase the required quota or free some required resource.
Note: You can check if all the required quotas are met by running the ./cloudkit validate quota command.
- Credential issues
-
- Check your GCP credentials. Ensure that the GCP credentials used to provision resources through
cloudkit are valid and have the appropriate permissions. Verify this by checking the GCP credential
JSON file used by cloudkit and confirming that they have the necessary
permissions.
google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information
- Check your GCP credentials. Ensure that the GCP credentials used to provision resources through
cloudkit are valid and have the appropriate permissions. Verify this by checking the GCP credential
JSON file used by cloudkit and confirming that they have the necessary
permissions.
- Transient network errors
-
- Cluster creation failure due to transient network errors:
- In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
- If this is encountered, rerun the
./cloudkit
command with the same parameters. - The cloudkit logs might list something like:
Error waiting for instance to create: error while retrieving operation: Get \"https://compute.googleapis.com/compute/v1/projects/<PROJECT_ID>/zones/us-east1-b/operations/operation-1696304597507-1111111-e49f6fc6-a5b7ecc4?alt=json&prettyPrint=false\": http2: client connection lost
- If the problem still persists after rerunning the command, delete the existing cluster and create a new cluster.
- If an IBM Storage Scale related problem is suspected, collect data by running the gpfs.snap command. Upload the output of the gpfs.snap command to the IBM Storage Scale support ticket that is opened.
- Plan your network infrastructure to ensure a reliable communication between installer node and
cloud. In jump host based connectivity, it could take little longer for
ssh
to reach the node, if there is a network drop, it is recommended to re-run.
- Cluster creation failure due to transient network errors:
Delete cluster issue
- If any GCP resources have been provisioned into the VPC outside of the cloudkit, the cloudkit delete command will be unable to delete these manually created resources. In this case, all resources created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be executed.
When this issue occurs, the following message may be displayed:
I: Destroy all remote objects managed by terraform (instance_template) configuration.
I: Destroy all remote objects managed by terraform (bastion_template) configuration.
I: Destroy all remote objects managed by terraform (vpc_template) configuration.
E: Delete cluster received error: exit status 1
Fix: Manually clean the cloud resources and contact IBM Support.
Edit cluster issue
- Quorum node issues
-
After running edit on any cluster, the following GPFS TIPS is displayed if you run the mmhealth node show command:
GPFS TIPS 1 hour ago callhome_not_enabled, quorum_too_little_nodes
This means that the edit operation is not assigning any new quorum node.
Fix: You need to manually assign additional quorum node from the newly added nodes.