Common cloudkit issues

This topic describes the common issues and workarounds that you might encounter while running cloudkit on any of the supported clouds.

Create cluster issues

Cloudkit produces logs that detail any errors or issues that are encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
Review the cloudkit inputs to ensure that all values are set correctly.

Transient network errors

Cluster creation failure due to transient network errors:
1. In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
2. If this is encountered, rerun the ./cloudkit command with the same parameters.
3. The cloudkit logs might list something like:
```
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
```
If an IBM Storage Scale related problem is suspected, collect data by running the gpfs.snap command. Upload the output of the gpfs.snap command to the IBM Storage Scale support ticket that is opened.
Plan your network infrastructure to ensure a reliable communication between installer node and cloud. In jump host-based connectivity, it might take little longer for ssh to reach the node, if there is a network drop, it is recommended to re-run.

Ansible-playbook error

Cluster creation fails with the error:

FileNotFoundError: [Errno 2] No such file or directory: 'ansible-playbook

Fix: If this error is found on a Red Hat® Enterprise Linux (RHEL) 9.x version for the cloudkit command, apply the workaround for ansible-core package installation.

Plug-in load errors

Cluster creation fails while the terraform plug-in is loading and the following error message appears:

Error: Failed to load plugin schemas
Error: while loading schemas for plugin components: Failed to obtain schema: Could not load the schema for provider

Fix: Rerun the ./cloudkit command with the same parameters.

Grant filesystem issue

File system issues

The file system gets unmounted after recovering from node failures. When performing a remote mount, after recovering from certain node failures, the file system can get unmounted.
Fix: Perform a mount on all the nodes by using the mmmount all -a command.

Credential issues

Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues that are encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.

Transient network errors

Cluster creation failure due to transient network errors:

In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
If this is encountered, rerun the ./cloudkit command with the same parameters.

The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Revoke file system issue

Credential issues

Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.

Transient network errors

Cluster creation failure due to transient network errors:

In some circumstances it might be possible to encounter latent network issues when you are deploying new cluster resources.
If this is encountered, rerun the ./cloudkit command with the same parameters.

The cloudkit logs might list something like:

"Failed to connect to the host via ssh: Connection timed out during banner exchange"

Cluster upgrade issues

Expanded new nodes are not associated with a node class

Following an upgrade, in few scenarios after an expansion of nodes, it might happen that the new nodes do not have a node class associated with them.

Fix: Check if the new nodes are listed in the respective node class. If the new nodes are missing, use the mmchnodeclass command to manually add the new nodes to the node class. For more information, see mmchnodeclass command.

GitHub provider issue

The next output displays a known issue that might occur when the update flow initiates:

2024-07-11T01:51:25.758-0400: Error while loading schemas for plugin components: Failed to obtain provider schema: Could not load the schema for provider registry.opentofu.org/integrations/github: failed to instantiate provider registry.opentofu.org/integrations/github to obtain schema: unavailable
2024-07-11T01:51:25.759-0400: Command returned non-zero exit code or error.
2024-07-11T01:51:25.759-0400: Error occurred while applying IaC. Details: exit status 1
2024-07-11T01:51:25.760-0400: Error occurred while executing opentofu. exit status 1
2024-07-11T01:51:25.760-0400: Upgrade of IBM Storage Scale cluster 'ibm-storage-scale' received error: exit status 1

To solve the issue, follow these steps.

At /usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template, create a file named "backend.tf" and add the following content in it:
```
terraform {
  backend "local" {
    path = "/root/scale-cloudkit/workarea/cluster/<cloud>/<cluster-name>/instance_template/terraform.tfstate"
                  }
          }
```
Remember: Replace <cluster-name> and <cloud> as applicable.

Copy the inputs.auto.tfvars.json and terraform.tfstate files.

cp /root/scale-cloudkit/workarea/cluster/<cloud>/<cluster-name>/instance_template/terraform.tfstate /usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template/
cp /root/scale-cloudkit/workarea/cluster/<cloud>/<cluster-name>/instance_template/inputs.auto.tfvars.json /usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template/

Remember: Replace <cluster-name> and <cloud> as applicable.

Run the tofu init command as shown in the next example.

/usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/tofu -chdir=/usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template/ init

Remember: Replace <cloud> as applicable.

After these steps are done, rerun the upgrade job.