Cloudkit issues on GCP

This topic describes the common issues and workarounds that you might encounter while running cloudkit on Google Cloud Platform (GCP).

Create cluster issues

Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
Review the cloudkit inputs to ensure that all values are set correctly.

Quota issues

The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the GCP account.

For example, if the routers quota is exceeded, the following error message is logged:

Error waiting to create Router: Error waiting for Creating Router: Quota 'ROUTERS' exceeded.  Limit: 20.0 globally.

Fix: Increase the required quota or free some required resource.

Note: You can check if all the required quotas are met by running the ./cloudkit validate quota command.

Credential issues

Check your GCP credentials. Ensure that the GCP credentials used to provision resources through cloudkit are valid and have the appropriate permissions. Verify this by checking the GCP credential JSON file used by cloudkit and confirming that they have the necessary permissions.
```
google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information
```

Transient network errors

Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:
```
Error waiting for instance to create: error while retrieving operation: Get \"https://compute.googleapis.com/compute/v1/projects/<PROJECT_ID>/zones/us-east1-b/operations/operation-1696304597507-1111111-e49f6fc6-a5b7ecc4?alt=json&prettyPrint=false\": http2: client connection lost
```
4. If the problem still persists after rerunning the command, delete the existing cluster and create a new cluster.
If an IBM Storage Scale related problem is suspected, collect data by running a gpfs.snap. Upload this gpfs.snap to the IBM Storage Scale support ticket that is opened.
Plan your network infrastructure to ensure a reliable communication between installer node and cloud. In jump host based connectivity, it could take little longer for ssh to reach the node, if there is a network drop, it is recommended to re-run.

Plug-in load errors

Cluster creation fails while the terraform plug-in is loading and the following error message appears:

Error: Failed to load plugin schemas
Error: while loading schemas for plugin components: Failed to obtain schema: Could not load the schema for provider

Fix: Rerun the ./cloudkit command with the same parameters.

Delete cluster issue

The cloudkit delete cluster command might encounter failures for the following reasons:

If any GCP resources have been provisioned into the VPC outside of the cloudkit, the cloudkit delete command will be unable to delete these manually created resources. In this case, all resources created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be executed.

When this issue occurs, the following message may be displayed:

I: Destroy all remote objects managed by terraform (instance_template) configuration.
I: Destroy all remote objects managed by terraform (bastion_template) configuration.
I: Destroy all remote objects managed by terraform (vpc_template) configuration.
E: Delete cluster received error: exit status 1

Fix: Manually clean the cloud resources and contact IBM Support.

Grant filesystem issue

Credential issues

Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.

Transient network errors

Cluster creation failure due to transient network errors:

In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
If this is encountered, rerun the ./cloudkit command with the same parameters.

The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Revoke file system issue

Credential issues

Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.

Transient Network errors

Cluster creation failure due to transient network errors:

In some circumstances it might be possible to encounter latent network issues when deploying new cluster resources.
If this is encountered, rerun the ./cloudkit command with the same parameters.

The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Edit cluster issue

Quorum node issues

After running edit on any cluster, the following GPFS TIPS is displayed if you run the mmhealth node show command:

GPFS TIPS 1 hour ago callhome_not_enabled, quorum_too_little_nodes

This means that the edit operation is not assigning any new quorum node.

Fix: You need to manually assign additional quorum node from the newly added nodes.