Cloudkit issues on AWS

This topic describes the common issues and workarounds that you might encounter while running cloudkit on Amazon Web Services (AWS).

Create cluster issues

Quota issues

The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the AWS account.

creating EC2 EIP: AddressLimitExceeded: The maximum number of addresses has been reached."}
iamInstanceProfile issue

Cluster creation fails with the error:

Failed: Value (ibm-storage-scale-20230510042321111100000001) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name. Launching EC2 instance failed.

Fix: Rerun cloudkit cluster creation with same parameters.

Credential issues
  • Check your AWS credentials. Ensure that the AWS credentials used to provision resources through cloudkit are valid and have the appropriate permissions. Verify this by checking the AWS access key and secret key that are used by cloudkit, confirm that they have the necessary IAM permissions.
    E: Not able to validate the provided access credentials! Use credentials that has valid permissions.
Host unreachable

Cluster creation fails with the error:

Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\", \"unreachable

Fix: Rerun cloudkit cluster creation with same parameters.

Instance boot timeout issues

Cluster creation might fail if an instance takes more than 10 minutes to finish the boot-up.

waiting for EC2 Instance create: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)

Fix: Verify if the instance has finished the boot-up. Cleanup this instance (irrespective of the state) and retry the cluster creation.

KMS (Key Management Service) key issue

Cluster creation fails with the error when EBS encryption is selected:

create: unexpected state 'shutting-down', wanted target 'running'. last error: Client.InternalError: Client error on launch\u001b[0m

Fix: Check whether the user that is running the cloudkit has read access to the provided KMS key.

Ansible-playbook error

Cluster creation fails with the error:

FileNotFoundError: [Errno 2] No such file or directory: 'ansible-playbook

Fix: If this error is found on a Red Hat® Enterprise Linux (RHEL) 9.x version for the cloudkit command, apply the workaround for ansible-core package installation.

Delete cluster issue

The cloudkit delete cluster command might encounter failures for the following reasons:
  • If any AWS resources have been provisioned into the virtual private cloud (VPC) outside of the cloudkit, the cloudkit delete command will be unable to delete these manually created resources. In this case, all resources created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be run.
  • When IBM Storage Scale clusters have been deployed into a VPC that has been previously created by another cloudkit create execution, ensure to delete the clusters in the correct order. If the clusters are not deleted in the proper order, then it might fail as resources are still in the VPC. For more information, see Cleanup.
  • If there is a session open and a user is in a file system mount point.
  • When instance types with greater EBS volume bandwidth are used, the Throughput-Advance-Persistent-Storage profile attaches each storage instance or NSD with multiple disks (sometimes, more than 10 disks per instance) to saturate the bandwidth. This issue may result in the following error message being displayed due to a parallelism problem that might lead to dependency violation during the delete operation.
    {"level":"debug","ts":"2024-09-20T06:36:47.989Z","caller":"commonhelpers/common_helpers.go:457","msg":"\u001b[31m│\u001b[0m \u001b[0m\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mdeleting EBS Volume (vol-0f0b6b6615429ce81): operation error EC2: DeleteVolume, https response error StatusCode: 400, RequestID: a834c3ad-e03c-4bd3-8903-35fba6834f3f, api error VolumeInUse: Volume vol-0f0b6b6615429ce81 is currently attached to {i-026915f80b644a0de}\u001b[0m"}

    Workaround: Run the cloudkit delete cluster command again.

Delete network issues

  • In the 5.2.2.0 release, deleting a network (VPC) may fail if the network was created in a previous release by using the cloudkit create network command.

    Workaround: Make sure that all resource dependencies are deleted. Navigate to the VPC console, delete the VPC, then delete the VPC associated dependencies. For more information, see Delete a VPC using the console in the AWS documentation.

Cluster upgrade issues

Incompatible self-extracting packages

Each IBM Storage Scale edition requires a specific upgrade package. For example, the self-extracting package for the IBM Storage Scale Developer Edition does not upgrade a cluster where the deployed edition of IBM Storage Scale is Advanced Edition, Data Management Edition, or Data Access Edition.

If you attempt to upgrade a cluster by using a self-extracting package that does not match the IBM Storage Scale edition that is deployed on that cluster, the upgrade fails with the following error:

"Cannot upgrade node 10.0.1.199 due to packages dependent on GPFS. If these are known external dependencies, you can choose to override by setting the environment variable \"SSFEATURE_OVERRIDE_EXT_PKG_DEPS=true\" environment variable. Instead if you would like to continue an upgrade on all other nodes using the install toolkit, please remove this node from the cluster definition via:  spectrumscale node delete 10.0.1.199 and then re-run spectrumscale upgrade.  Otherwise, either remove the dependent packages manually or manually upgrade GPFS on this node."}

Fix: Make sure to use a self-extracting package that does match the IBM Storage Scale edition that is deployed on that cluster and rerun the cloudkit create cluster command.

Cluster edit issues

Editing is not supported for the Balanced profile
Although the option to edit the protocol is available, this operation is not supported for the Balanced profile. If you try to edit the protocol for the Balanced profile, the operation fails and an error message similar to the following one is displayed.
{"level":"debug","ts":"2024-10-27T10:47:30.947-0700","caller":"commonhelpers/common_helpers.go:457","msg":"\u001b[31m│\u001b[0m \u001b[0m\u001b[1m\u001b[31mError: \u001b[0m\u001b[0m\u001b[1mcreating EC2 Instance: operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: e835b084-0c36-44b2-80f4-2c067eccc99e, api error InvalidParameterValue: Value (eu-west-1c) for parameter availabilityZone is invalid. Network interface 'eni-01b2df9d9165965cd' is in the availability zone eu-west-1a\u001b[0m"} {"level":"debug","ts":"2024-10-27T10:47:30.948-0700","caller":"commonhelpers/common_helpers.go:457","msg":"\u001b[31m│\u001b[0m \u001b[0m"} {"level":"debug","ts":"2024-10-27T10:47:30.948-0700","caller":"commonhelpers/common_helpers.go:457","msg":"\u001b[31m│\u001b[0m \u001b[0m\u001b[0m with module.protocol_instances[\”scalestrcls1-protocol-1\"].aws_instance.itself,"} {"level":"debug","ts":"2024-10-27T10:47:30.948-0700","caller":"commonhelpers/common_helpers.go:457","msg":"\u001b[31m│\u001b[0m \u001b[0m on ../../../resources/aws/compute/ec2_multiple_nic/ec2_multiple_nic.tf line 65, in resource \"aws_instance\" \"itself\":"} {"level":"debug","ts":"2024-10-27T10:47:30.948-0700","caller":"commonhelpers/common_helpers.go:457","msg":"\u001b[31m│\u001b[0m \u001b[0m 65: resource \"aws_instance\" \"itself\" \u001b[4m{\u001b[0m\u001b[0m"}