Cloudkit issues on Azure

This topic describes the common issues and workarounds that you might encounter while running cloudkit and using Microsoft Azure.

Create cluster issues

Quota issues

The IBM Storage Scale cluster creation execution may fail due to insufficient quotas in the Azure account.

Operation results in exceeding quota limits of Core. Maximum allowed: 20, Current in use: 20, Additional requested: 4

Fix: Increase the required quota or free some required resource.

Credential issues
  • Check your Azure credentials. Ensure that the Azure credentials used to provision resources through cloudkit are valid and have the appropriate permissions. Verify this by checking the Azure access key used by cloudkit and confirming that they have the necessary permissions.
    Error occurred while page advance. 
    Details: ClientSecretCredential: unable to resolve an endpoint: http call(https://login.microsoftonline.com/XXXXXX-XXX-XXX-XXX-XXXXXXXX/v2.0/.well-known/openid-configuration)(GET) error: reply status code was 400
Host unreachable

Cluster creation fails with the next error:

Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\", \"unreachable

Fix: Rerun cloudkit create cluster with the same parameters.

Issues because of instance boot timeout

Cluster creation can fail with the following error if an instance takes more than 10 minutes to finish the initialization.

Error occured while performing metadata query request. 
Details: Get \"http://XXX.XX.XX.XX/metadata/instance?api-version=2021-02-01&format=json\": context deadline exceeded (Client.Timeout exceeded

Fix: Verify if the instance has finished the initialization. No matter its state, clean this instance and retry the cluster creation.

Passwordless retries
The deployment of an Azure cluster may intermittently fail with a check passwordless... retries message, like the ones highlighted in the following example.
{"level":"debug","ts":"2024-10-29T11:47:26.235-0400","caller":"commonhelpers/common_helpers.go:457","msg":"[ INFO ] changed: [storagescale-storage-5.cls-defectets.strgscale.com -> localhost]"} {"level":"debug","ts":"2024-10-29T11:49:26.827-0400","caller":"commonhelpers/common_helpers.go:457","msg":"[ INFO ] FAILED - RETRYING: check | check passwordless ssh (30 retries left)."} {"level":"debug","ts":"2024-10-29T11:51:46.096-0400","caller":"commonhelpers/common_helpers.go:457","msg":"[ INFO ] FAILED - RETRYING: check | check passwordless ssh (29 retries left)."}
Also a line similar to the following message snippet may be observed in the ~/.ssh/authorized_keys file. This file is present on the respective cluster instance.
no-port-forwarding, .... echo;sleep 10

Workaround: Log into the instances, go the ~/.ssh/authorized_keys file, and delete the line that is similar to the previous message snippet.

Delete cluster issue

The cloudkit delete cluster command might encounter failures for the following reasons:
  • If any Azure resources have been provisioned into the VNET outside of the cloudkit, the cloudkit delete command will be unable to delete these manually created resources. In this case, all resources created outside of the cloudkit must be deleted manually before the cloudkit delete operation can be executed.
  • When IBM Storage Scale clusters have been deployed into a VNET that has been previously created by another cloudkit create execution, ensure to delete the clusters in the correct order. If the clusters are not deleted in the proper order then it might fail as resources are still in the VNET. For more information, see Cleanup.
  • Deletion of an Azure cluster fails while running a cluster deletion for a large cluster (for example, a cluster with more than sixty nodes):
    Operation 'startTenantUpdate' is not allowed on VM ‘storage-scale-cls-3' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete). 

    Fix: Wait for some time and then rerun cluster delete command.

  • Deletion of an Azure cluster fails if you try to delete a partially provisioned cluster (failed cluster creation). A log similar to the following one may appear:
    Error: deleting Managed Disk \"scale-compute-storage-4-fs1-system-1\" (Resource Group \"scale-storage-rg\"): performing Delete: unexpected status 409 (409 Conflict) with error: OperationNotAllowed: Disk storage-scale-cls-storage-4-fs1-system-1' is being attached to VM '/subscriptions/xxxxxxxxx-xxxx-xxxxxxx/resourceGroups/scale-storage-rp/providers/Microsoft.Compute/virtualMachines/scale-compute-storage-4', therefore operation is not allowed