IBM Support

Recovering Lost Instances within SmartCloud Provisioning

Troubleshooting


Problem

After hyper node recovers from severe kernel panic (or other occurrences of a critical malfunction), some instances might not be successfully restored by IBM SmartCloud Provisioning.

Symptom

In the user's view, the instance cannot be shown in the web console, the instance cannot be pinged, the vnc connection fails.

In the admin's view, the instance cannot be described.

On the lost instance's hyper node, the instance's json data should be moved to a special folder for lost instances.

Diagnosing The Problem


Resolving The Problem

Locate the lost instances from json_db on hyper nodes

Each instance has its essential information (such as global id, private ip, etc...) stored in the form of json on its hyper node. The HBase will synchronize its tables with the json db. In order to recovery a lost instance, we must locate its json file at one of the hyper nodes.


Deployed VM ID is provided

  1. Log in to the hyper node

    #: ssh hyper_node_IP

  2. Go to the json_db directory

    #:cd /iaas/local-storage/json_db/

  3. The lost vms is stored as its VM_ID in subdirectory "deleted_instances", others in "instances".

    Find the details of this vm.

    #:cd deleted_instance
    #: vim bj02.cn029.52389.u01585

    The file looks something below.

    {"image_id":"sles11-x64-30g","kernel":"","persistent_image_id":null,"is_new":true,"graphics":"vnc","image_info":{"image_id":"sles11-x64-30g","kernel":"","format":"raw","size":"32212254720","image_tag":"","mode":"private","replication":3,"arch":"x86_64","ovf_envelope":"","enable_virtio":"no","ramdisk":"","create_time":"2011-05-31T18:19:06+08:00","subtype":"ide","type":"image","image_acl":"","ovf_type":"no","description":"SLES%2011%20x86_64%20standard%20master%20image%2C%20supports%20volume","user_prefix":"u01585","parent_image_id":"","status":"available","enable_sysprep":"","ovf_envelope_base64":"","platform":"linux","file_id":"sles11-x64-30g"},"graphics_connection":"9.111.108.29:5926","size":"small","key_name":"default","gid":"I-2421149570-u01585","from_cow":true,"addresses":[],"host_name":"v525400b2ef1a","instance_acl":"","memory":1024,"is_persistent":"n","instance_id":"bj02.cn029.52389.u01585","domain_id":"I-24211-49570-u01585","mac":"52:54:00:b2:ef:1a","working_state":"","available_time":"2012-0801T07:47:09Z","ram_dsk":"","create_time":"2012-08-01T15:38:23+08:00","rack":"bj02","hyper":"bj02.cn029","vlans":{"virtual":[{"interface":0,"static_ip":null,"vlan_name":"br0"}],"physical":[]},"host_name_original":null,"launch_time":"2012-08-01T07:38:23Z","volumes"[],"cow_type":"dm","static_ip":[null],"private_ip":"9.111.108.149","instance_tag":null,"graphics_passwd":"05687","cow_path":"/iaas/local-storage/I-24211-49570-u01585.dm","cpu":1,"user_prefix":"u01585","hostname_managed":"n","vlan_names_original":null,"cow_frontend":"/dev/mapper/I-24211-49570-u01585.dm","macs":[{"mac":"52:54:00:b2:ef:1a","target":"vnet26","bridge":"br0"}],"public_ip":null,"init_disks":[],"nr_virt_cpu":1,"platform":"linux","state":"running","hypervisorid_scp":null}

    Search for keywords gid, instance_id


Deployed VM ID is not provided

Note: In this case either the global id or ip address is known

Search all hyper nodes iteratively to find the lost instance's json data.
  • Method to locate the json file:

    For example, suppose IP of lost instance is 9.111.108.149.
    1. Access a node with ssh command.
    2. Search the key.

      Replace the <part of hyper ip> by the hyper address pattern

      Replace the <VM key> by the search key.

      #: for i in `./iaas-describe-nodes -t hyper|awk -F"|" '$5 ~ /<part of hyper ip>/ {print $5}'`;do ssh $i "hostname;grep <VM key> /iaas/local-storage/json_db/* -r"; done

    3. If you see the json file that contains the instance ip printed behind the hyper node's hostname then you have located the instance, otherwise, this ip may be recycled.

Method to recover instance once located

A script is provided to manually recovering instance.
  1. SSH to the instance's hyper node

  2. Download the script into the default location:

    #: wget -N http://cloud_ip/recovery.tgz
    #: tar zxf recovery.tgz -C /iaas/hyper_bots/rubybots/


    Note: Ensure recover.cfg matches your environment configuration, for example in terms of IP addresses of storage nodes and hbase region server.

  3. Run the script:

    #: cd /iaas/hyper_bots/rubybots/recovery
    #: ruby start_recover.rb recover
    hostname_instance_recovered


    This script does following:
    • Finds image id from json db entry and recreate multipath device
    • Creates iscsi session for attached volume (optional)
    • Recreates domain from backup domain xml
    • Updates HSLT json db


    Note
    : If you see the following error message:

    error: Failed to create domain from /tmp/bj02.cn029.52389.u01585.xml
    error: cannot read header '/dev/mapper/I-24211-49570-u01585.dm': Input/output error

    Review technote 1648306: Manually repair invalid instances by shed for resolution.

    Check if the instance is now running

    #: virsh list |grep hostname_instance_recovered


    If it is running, modify the instance json file and move it into the running instances' directory:

    #: cd  /iaas/local-storage/json_db/deleted_instances/
    #: vim hostname_instance_recovered

    Note: search key word lost, replace with running

    #: mv hostname_instance_recovered ../instances/

    [{"Product":{"code":"SSZH3R","label":"IBM Service Agility Accelerator for Cloud"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"General Information","Platform":[{"code":"PF016","label":"Linux"}],"Version":"2.1;2.1.0.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

    Document Information

    Modified date:
    17 June 2018

    UID

    swg21648160