Troubleshooting
Problem
After hyper node recovers from severe kernel panic (or other occurrences of a critical malfunction), some instances might not be successfully restored by IBM SmartCloud Provisioning.
Symptom
In the user's view, the instance cannot be shown in the web console, the instance cannot be pinged, the vnc connection fails.
In the admin's view, the instance cannot be described.
On the lost instance's hyper node, the instance's json data should be moved to a special folder for lost instances.
Diagnosing The Problem
Resolving The Problem
Locate the lost instances from json_db on hyper nodes
Each instance has its essential information (such as global id, private ip, etc...) stored in the form of json on its hyper node. The HBase will synchronize its tables with the json db. In order to recovery a lost instance, we must locate its json file at one of the hyper nodes.
Deployed VM ID is provided
- Log in to the hyper node
#: ssh hyper_node_IP
- Go to the json_db directory
#:cd /iaas/local-storage/json_db/
- The lost vms is stored as its VM_ID in subdirectory "deleted_instances", others in "instances".
Find the details of this vm.
#:cd deleted_instance
#: vim bj02.cn029.52389.u01585
The file looks something below.
{"image_id":"sles11-x64-30g","kernel":"","persistent_image_id":null,"is_new":true,"graphics":"vnc","image_info":{"image_id":"sles11-x64-30g","kernel":"","format":"raw","size":"32212254720","image_tag":"","mode":"private","replication":3,"arch":"x86_64","ovf_envelope":"","enable_virtio":"no","ramdisk":"","create_time":"2011-05-31T18:19:06+08:00","subtype":"ide","type":"image","image_acl":"","ovf_type":"no","description":"SLES%2011%20x86_64%20standard%20master%20image%2C%20supports%20volume","user_prefix":"u01585","parent_image_id":"","status":"available","enable_sysprep":"","ovf_envelope_base64":"","platform":"linux","file_id":"sles11-x64-30g"},"graphics_connection":"9.111.108.29:5926","size":"small","key_name":"default","gid":"I-2421149570-u01585","from_cow":true,"addresses":[],"host_name":"v525400b2ef1a","instance_acl":"","memory":1024,"is_persistent":"n","instance_id":"bj02.cn029.52389.u01585","domain_id":"I-24211-49570-u01585","mac":"52:54:00:b2:ef:1a","working_state":"","available_time":"2012-0801T07:47:09Z","ram_dsk":"","create_time":"2012-08-01T15:38:23+08:00","rack":"bj02","hyper":"bj02.cn029","vlans":{"virtual":[{"interface":0,"static_ip":null,"vlan_name":"br0"}],"physical":[]},"host_name_original":null,"launch_time":"2012-08-01T07:38:23Z","volumes"[],"cow_type":"dm","static_ip":[null],"private_ip":"9.111.108.149","instance_tag":null,"graphics_passwd":"05687","cow_path":"/iaas/local-storage/I-24211-49570-u01585.dm","cpu":1,"user_prefix":"u01585","hostname_managed":"n","vlan_names_original":null,"cow_frontend":"/dev/mapper/I-24211-49570-u01585.dm","macs":[{"mac":"52:54:00:b2:ef:1a","target":"vnet26","bridge":"br0"}],"public_ip":null,"init_disks":[],"nr_virt_cpu":1,"platform":"linux","state":"running","hypervisorid_scp":null}
Search for keywords gid, instance_id
Deployed VM ID is not provided
Note: In this case either the global id or ip address is known
Search all hyper nodes iteratively to find the lost instance's json data.
- Method to locate the json file:
For example, suppose IP of lost instance is 9.111.108.149.
- Access a node with ssh command.
- Search the key.
Replace the <part of hyper ip> by the hyper address pattern
Replace the <VM key> by the search key.
#: for i in `./iaas-describe-nodes -t hyper|awk -F"|" '$5 ~ /<part of hyper ip>/ {print $5}'`;do ssh $i "hostname;grep <VM key> /iaas/local-storage/json_db/* -r"; done
- If you see the json file that contains the instance ip printed behind the hyper node's hostname then you have located the instance, otherwise, this ip may be recycled.
Method to recover instance once located
A script is provided to manually recovering instance.
- SSH to the instance's hyper node
- Download the script into the default location:
#: wget -N http://cloud_ip/recovery.tgz
#: tar zxf recovery.tgz -C /iaas/hyper_bots/rubybots/
Note: Ensure recover.cfg matches your environment configuration, for example in terms of IP addresses of storage nodes and hbase region server.
- Run the script:
#: cd /iaas/hyper_bots/rubybots/recovery
#: ruby start_recover.rb recover hostname_instance_recovered
This script does following: - Finds image id from json db entry and recreate multipath device
- Creates iscsi session for attached volume (optional)
- Recreates domain from backup domain xml
- Updates HSLT json db
Note: If you see the following error message:
error: Failed to create domain from /tmp/bj02.cn029.52389.u01585.xml
error: cannot read header '/dev/mapper/I-24211-49570-u01585.dm': Input/output error
Review technote 1648306: Manually repair invalid instances by shed for resolution.
Check if the instance is now running
#: virsh list |grep hostname_instance_recovered
If it is running, modify the instance json file and move it into the running instances' directory:
#: cd /iaas/local-storage/json_db/deleted_instances/
#: vim hostname_instance_recovered
Note: search key word lost, replace with running
#: mv hostname_instance_recovered ../instances/
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
swg21648160