Common cloudkit issues
This topic describes the common issues and workarounds that you might encounter while running cloudkit on any of the supported clouds.
Create cluster issues
- Cloudkit produces logs that detail any errors or issues that are encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
- Review the cloudkit inputs to ensure that all values are set correctly.
- Transient network errors
-
- Cluster creation failure due to transient network errors:
- In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
- If this is encountered, rerun the
./cloudkitcommand with the same parameters. - The cloudkit logs might list something like:
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
- If an IBM Storage Scale related problem is suspected, collect data by running the gpfs.snap command. Upload the output of the gpfs.snap command to the IBM Storage Scale support ticket that is opened.
- Plan your network infrastructure to ensure a reliable communication between installer node and
cloud. In jump host-based connectivity, it might take little longer for
sshto reach the node, if there is a network drop, it is recommended to re-run.
- Cluster creation failure due to transient network errors:
- Ansible-playbook error
-
Cluster creation fails with the error:
FileNotFoundError: [Errno 2] No such file or directory: 'ansible-playbookFix: If this error is found on a Red Hat® Enterprise Linux (RHEL) 9.x version for the
cloudkitcommand, apply the workaround foransible-corepackage installation.
- Plug-in load errors
-
Cluster creation fails while the terraform plug-in is loading and the following error message appears:
Error: Failed to load plugin schemas Error: while loading schemas for plugin components: Failed to obtain schema: Could not load the schema for providerFix: Rerun the ./cloudkit command with the same parameters.
Grant filesystem issue
- File system issues
-
- The file system gets unmounted after recovering from node failures. When performing a remote
mount, after recovering from certain node failures, the file system can get unmounted.
Fix: Perform a mount on all the nodes by using the mmmount all -a command.
- The file system gets unmounted after recovering from node failures. When performing a remote
mount, after recovering from certain node failures, the file system can get unmounted.
- Credential issues
-
- Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues that are encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
- If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.
- Transient network errors
-
Cluster creation failure due to transient network errors:
- In some circumstances it might be possible to encounter latent network issues may be encountered while deploying new cluster resources.
- If this is encountered, rerun the
./cloudkitcommand with the same parameters. - The cloudkit logs might list something
like:
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
Revoke file system issue
- Credential issues
-
- Check for errors in cloudkit logs. Cloudkit produces logs that detail any errors or issues encountered during the provisioning and configuration process. Review the logs to identify any specific errors or issues that may be causing problems.
- If the logs indicate a failure in accessing the IBM Storage Scale management GUI, ensure that the credentials provided for accessing the GUI are correct and rerun if necessary after rectifying the credentials.
- Transient network errors
-
Cluster creation failure due to transient network errors:
- In some circumstances it might be possible to encounter latent network issues when you are deploying new cluster resources.
- If this is encountered, rerun the
./cloudkitcommand with the same parameters. - The cloudkit logs might list something
like:
"Failed to connect to the host via ssh: Connection timed out during banner exchange"
Cluster upgrade issues
- Expanded new nodes are not associated with a node class
-
Following an upgrade, in few scenarios after an expansion of nodes, it might happen that the new nodes do not have a node class associated with them.
Fix: Check if the new nodes are listed in the respective node class. If the new nodes are missing, use the mmchnodeclass command to manually add the new nodes to the node class. For more information, see mmchnodeclass command.
- GitHub provider issue
-
The next output displays a known issue that might occur when the update flow initiates:
2024-07-11T01:51:25.758-0400: Error while loading schemas for plugin components: Failed to obtain provider schema: Could not load the schema for provider registry.opentofu.org/integrations/github: failed to instantiate provider registry.opentofu.org/integrations/github to obtain schema: unavailable 2024-07-11T01:51:25.759-0400: Command returned non-zero exit code or error. 2024-07-11T01:51:25.759-0400: Error occurred while applying IaC. Details: exit status 1 2024-07-11T01:51:25.760-0400: Error occurred while executing opentofu. exit status 1 2024-07-11T01:51:25.760-0400: Upgrade of IBM Storage Scale cluster 'ibm-storage-scale' received error: exit status 1To solve the issue, follow these steps.- At
/usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template,
create a file named "backend.tf" and add the following content in
it:
terraform { backend "local" { path = "/root/scale-cloudkit/workarea/cluster/<cloud>/<cluster-name>/instance_template/terraform.tfstate" } }Remember: Replace <cluster-name> and <cloud> as applicable. - Copy the inputs.auto.tfvars.json and terraform.tfstate
files.
cp /root/scale-cloudkit/workarea/cluster/<cloud>/<cluster-name>/instance_template/terraform.tfstate /usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template/ cp /root/scale-cloudkit/workarea/cluster/<cloud>/<cluster-name>/instance_template/inputs.auto.tfvars.json /usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template/Remember: Replace <cluster-name> and <cloud> as applicable. - Run the tofu init command as shown in the next
example.
/usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/tofu -chdir=/usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template/ initRemember: Replace <cloud> as applicable.
After these steps are done, rerun the upgrade job.
- At
/usr/lpp/mmfs/5.2.1.0/cloudkit/dependencies/ibm-spectrum-scale-cloud-install/<cloud>_scale_templates/sub_modules/instance_template,
create a file named "backend.tf" and add the following content in
it: