Cloudkit issues on GCP

This topic describes the common issues and workarounds that you might encounter while running cloudkit on Google Cloud Platform (GCP).

Create cluster issues

Quota issues

The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the GCP account.

For example, if the routers quota is exceeded, the following error message is logged:

Error waiting to create Router: Error waiting for Creating Router: Quota 'ROUTERS' exceeded.  Limit: 20.0 globally.

Fix: Increase the required quota or free some required resource.

Note: You can check if all the required quotas are met by running the ./cloudkit validate quota command.

Credential issues

Check your GCP credentials. Ensure that the GCP credentials used to provision resources through cloudkit are valid and have the appropriate permissions. Verify this by checking the GCP credential JSON file used by cloudkit and confirming that they have the necessary permissions.
```
google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information
```

Transient network errors

Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:
```
Error waiting for instance to create: error while retrieving operation: Get \"https://compute.googleapis.com/compute/v1/projects/<PROJECT_ID>/zones/us-east1-b/operations/operation-1696304597507-1111111-e49f6fc6-a5b7ecc4?alt=json&prettyPrint=false\": http2: client connection lost
```
4. If the problem still persists after rerunning the command, delete the existing cluster and create a new cluster.
If an IBM Storage Scale related problem is suspected, collect data by running the gpfs.snap command. Upload the output of the gpfs.snap command to the IBM Storage Scale support ticket that is opened.
Plan your network infrastructure to ensure a reliable communication between installer node and cloud. In jump host based connectivity, it could take little longer for ssh to reach the node, if there is a network drop, it is recommended to re-run.

Delete cluster issue

The cloudkit delete cluster command might encounter failures for the following reasons:

If any GCP resources have been provisioned into the VPC outside of the cloudkit, the cloudkit delete command will be unable to delete these manually created resources. In this case, all resources created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be executed.

When this issue occurs, the following message may be displayed:

I: Destroy all remote objects managed by terraform (instance_template) configuration.
I: Destroy all remote objects managed by terraform (bastion_template) configuration.
I: Destroy all remote objects managed by terraform (vpc_template) configuration.
E: Delete cluster received error: exit status 1

Fix: Manually clean the cloud resources and contact IBM Support.

Edit cluster issue

Quorum node issues

After running edit on any cluster, the following GPFS TIPS is displayed if you run the mmhealth node show command:

GPFS TIPS 1 hour ago callhome_not_enabled, quorum_too_little_nodes

This means that the edit operation is not assigning any new quorum node.

Fix: You need to manually assign additional quorum node from the newly added nodes.