Troubleshooting deployment issues

Some common issues, limitations, and logs related to deployment.

Common issues and solutions in IBM Cloud Kubernetes Service

The deployment of IBM Cloud Kubernetes Service template sometimes fails with a 404 page not found error. It can occur when the cluster endpoints become temporarily unavailable in IBM Cloud during the deployment. Performing a plan and apply resolves the error.

Common issues and solutions in vSphere

  • The following error might occur because of Self-Signed certificates:

    provider.vsphere: Error setting up client: Post https://192.168.64.140/sdk: x509: cannot validate certificate for 192.168.64.140 because it doesn't contain any IP SANs

    As a resolution, set the parameter "allow_selfsigned_cert" to "true".

  • The following error occurs whenever vSphere virtual machine is in powered on state:

    vsphere_virtual_machine.vm_1: The attempted operation cannot be performed in the current state (Powered on).

    A virtual machine with same hostname/IP already present.

    • To create folder, use create_vm_folder=1.

      Use integer 1 to create a virtual machine only if it is not already available.

    • IPv4 Prefix: ipv4_prefix_length="24"

      This is a number string for prefix length.

Common issues in Amazon WebServiced Cloud (AWS)

  • The following error message might occur whenever the virtual private cloud (VPC) is not found:

    VPC not found

    As a resolution, provide a "Name" for VPC and refer to the name in the template as opposed the VPC ID.

  • Errors occur whenever you do not specify unique public keys.

  • The following error might occur when using an older images for deployment:

    Error: Your query returned no results. You must change your search criteria and try again with data.aws_ami.aws_ami.

    The deployment using Ubuntu14 fails as AWS has removed support for older versions of operating system.

    The solution is to replace prefilled AMI with the available AMI in AWS. For example: replace ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-* with ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*.

    The sample templates are also updated to use the latest AMI. To pull the latest updated templates, delete the existing templates/services and restart cam-iaas* and cam-service-composer-api* pods. Alternatively, import the updated sample templates/services, see Importing a template.

Common issues in IBM Cloud

  • The following error message might occur because of SSH keys:

    ibm_compute_ssh_key.orpheus_public_key: Error deleting SSH key: SoftLayer_Exception_Public: SSH key cannot be deleted because it is currently being used in an active transaction. (HTTP 500)Keys

    IBM Cloud does not allow you to upload two identical pub keys having the same fingerprints. However, the IBM Cloud Terraform provider does not throw an error instead reuses that key. Issues occur whenever you try to destroy the key resource that is now associated with another running virtual machine.

    The solution is to use different keys for each deployment, reference an existing key or retry the destroy/delete to get past the error.

  • The following error might occur when using an older image for deployment:

    [ERROR] Error ordering virtual guest: SoftLayer_Exception_Order_Item_Unavailable: Debian GNU/Linux 9.x Stretch/Stable - Minimal Install (64 bit)

    The deployment using Debian9 fails as IBM Cloud has removed support for older versions of the operating system.

    The solution is to replace prefilled Operating System ID with the available ID in IBM Cloud. For example: replace DEBIAN_9_64 with DEBIAN_10_64.

    The sample templates are also updated to use the latest Operating System ID. To pull the latest updated templates, delete the existing templates/services and restart cam-iaas* and cam-service-composer-api* pods. Alternatively, import the updated sample templates/services, see Importing a template.

Common issues using older terraform engines or terraform templates

  • The following error might occur when using deprecated terraform engine or terraform templates:

    NoCompatibleProviderEngineVersionError

    Terraform engine is removed and terraform templates are updated to use the Terraform 1.x.x version as Terraform engine 0.11.x is out of support.

    The solution is to use the latest terraform engine and terraform templates based on latest terraform version. For more information, see Managing Terraform Versions to use an older terraform version and see, Importing a template to import an older terraform template.

General issues

  • Set a custom maximum number of Terraform jobs to process in parallel per cam-provider-terraform pod:

    1. Run the following command to edit the deployment for cam-provider-terraform.

      • If isolateRuntime = false, then run the following command:

        kubectl -n management-infrastructure-management edit deployment cam-provider-terraform-api
        
      • If isolateRuntime = true, then run the following command:

        kubectl -n management-infrastructure-management edit deployment cam-provider-terraform-runtime
        
    2. Update the value of MAX_LOCAL_TERRAFORM_JOBS.

      containers:
      - env:
        - name: MAX_LOCAL_TERRAFORM_JOBS
          value: "10"
      

      In this example, it is set to 10.

  • The IP address might change after you restart a virtual machine, but it might not get reflected on the Managed services user interface. To refresh the resource state including the IP address, either run a plan/apply or use the refresh API to refresh the resource.

  • Failures might occur in the deployment of a Terraform template that executes actions on a Windows image by using a WinRM connection type. As a resolution, check whether "AllowUnencrypted" parameter of WinRM configuration is set to true in the Windows image that is used in the Terraform template. For more information, see Terraform documentation External link icon.

  • You might observe a failure whenever you do back-to-back interdependent VMware operations. For example, a "Start" followed by "Shutdown OS" causes a failure because the second action requires VMware Tools that has not got started yet.

  • Managed services on IBM Cloud Pak for AIOps has a predefined 6 hours deployment window. Do the following steps to add TERRAFORM_JOB_TIMEOUT_MS environment variable to the deployment:

    1. Run the following command to open the deployment in edit mode.

      • If isolateRuntime = false, then run the following command:

        kubectl -n management-infrastructure-management edit deployment cam-provider-terraform-api
        
      • If isolateRuntime = true, then run the following command:

        kubectl -n management-infrastructure-management edit deployment cam-provider-terraform-runtime
        
    2. Add the TERRAFORM_JOB_TIMEOUT_MS variable with value.

      For example, the value of TERRAFORM_JOB_TIMEOUT_MS is updated to 10 hours in millisecond:

      name:TERRAFORM_JOB_TIMEOUT_MS
      value: "36000000"
      
  • During template deployment, you must specify user credentials based on the operating system.

    • For IBM Cloud with Red Hat Enterprise Edition operating system, specify "root" as the user.
    • For AWS with Red Hat Enterprise Edition operating system, specify "ec2-user" as the user.
    • For AWS with Ubuntu operating system, specify "ubuntu" as the user name.

    If the appropriate user is not provided for the corresponding operating system, then the deployment fails with an error message.

    For example, If you do not provide "root" as the user while deploying a VMware template and the passwordless sudo is also not enabled, then the following error message is displayed:

    Error: Response from pattern manager:
    StatusCode:500
    Message:
    {
     “message”: “Bootstrap command failed. See the pattern manager logs for more details.“,
     “rc”: 1,
     “request_id”: “1c844be316b74601853d2ddde595d94d”,
     “stderr”: “Creating new client for cam-apache-1\nCreating new node for cam-apache-1\nConnecting to 0.0.0.0\nstty: ‘standard     input’: Inappropriate ioctl for device\nstty: ‘standard input’: Inappropriate ioctl for device\nstty: ‘standard input’: Inappropriate   ioctl for device”,
     “stdout”: “0.0.0.0 knife sudo password: \nEnter your password: \r\n0.0.0.0 \r\n0.0.0.0 Sorry, try again.\r \n0.0.0.0 knife sudo password: \n0.0.0.0 \r\n0.0.0.0 Sorry, try again.\r\n0.0.0.0 knife sudo password: \n0.0.0.0 \r\n0.0.0.0 sudo: 3 incorrect password attempts”
    }
    
  • You can import templates from GitHub or GitLab and deploy them. However, if you remove or change access token in GitHub or GitLab and then try to destroy instance, the operation fails with the following error message:

    GithubClientError: vmware/terraform: failed to retrieve content list from github.; caused by {"message":"Not Found", "documentation_url":"https://developer.github.com/v3"}

    Ensure that the Managed services has a connection to GitHub/Gitlab when you destroy the instance.

  • CAM deployment stays in Progress - Log File repeating null_resource.clone_git: Still creating..."

    The log file can record the template deployment status as in progress for a very long time. To identify the reason for the issue, do the following analysis on the deployed virtual machine wherein the problem occurred:

    • Run the following command to check whether your user has passwordless sudo:

      sudo cat /etc/sudoers

      When it prompts for a password, enable passwordless sudo for the specific user.

    • If the user does not have sudo privileges, add that user to /etc/sudoers.

    • Run the following command to check if name server is set up in your machine:

      ping github.com

    • Check if the curl or Git commands are installed on the virtual machine. It might be an issue with the apt mirror.

  • SSL Certificate error while deploying a template.

    To setup Git client locally:

    • Create .gitconfig file in the same Git repository as the template. The file should point to the correct CA certificates.

      [http]
          sslCAInfo = /home/terraform/certs/ca-bundle.crt
      
    • When the provider Terraform deploys the template, it downloads the template and all its files from the Git repository, which includes the .gitconfig. When the Terraform init is executed, the Git client looks up in the current folder (user's home folder for the current process) and it uses the appropriate CA certificates to download the modules from Git.

  • To debug Terraform template deployment, set TF_LOG environment variable in cam-provider-terraform pod. You can set TF_LOG to one of these log levels TRACE, DEBUG, INFO, WARN, or ERROR to change the verbosity of the logs. TRACE is the most verbose log level. This results in the detailed logs to appear on stderr.

    1. Run the following command to open the deployment in edit mode:

      • If isolateRuntime = false, then run the following command:
      kubectl -n management-infrastructure-management edit deployment cam-provider-terraform-api
      
      • If isolateRuntime = true, then run the following command:
      kubectl -n management-infrastructure-management edit deployment cam-provider-terraform-runtime
      
    2. Add the TF_LOG variable with value.

      name: TF_LOG
      value: "TRACE"
      

Logs

The logs that you can examine are as follows:

  • Check the template instance Terraform logs on the UI for Terraform or template errors.

  • Check logs in containers for internal Managed services errors wherein the CAM_logs is in the cam-logs-pv path:

    • /CAM_logs/cam-iaas
    • /CAM_logs/provider-terraform-local