In an xCAT managed environment the LPARs are managed by an executive management server (EMS) and a service node (SN), if it is a hierarchical environment. By default non-root users cannot execute xCAT commands or perform xdsh operations from these systems. A subset of xCAT commands may be useful to users that need to investigate an unresponsive node. The following sections will explain how the system administrator can setup access for a non-root id (in advance) and then how users can use this access to troubleshoot failing LPARs.
This example uses an AIX
environment, but it can be modified easily for another operating
system including Linux. It is based off the xCAT procedure for
granting privileges to non-root users
This section describes the high levels steps from the xCAT documentation which are executed on the EMS. Start by creating a non-root user id including a valid home directory. Then run the following command to create the necessary SSL certificates.
Verify these files are located in the user's $HOME/.xcat directory.
Modify the policy table with this entry. Here we are telling xCAT that the userid1 can execute four commands only. The user will receive a permission denied error if any other xCAT commands are issued.
Update the user's path with to include the xCAT directories and run a simple command to validate the configuration.
$PATH | grep xcat
The step to setup 'xdsh' is skipped because it is not needed for this implementation.
Environment setup (alternatives)
This access can be setup
alternatively on a service or login node. It requires a few
additional steps to setup SSH keys and install a subset of the xCAT
RPMs on the login node (if selected). Refer to the xCAT procedure
located in the Resources section for more information.
Checking a node status
An unresponsive node is an LPAR that you cannot ssh or telnet to. Log into the designated machine (EMS, SN or Login node) with your non-root user. First check to see if the LPAR has a known issue. For example the system administrator may have document this in the /etc/motd or created a node list with LPARs that are broken.
If the node does not appear on this list proceed with the following instructions. First check the power status. A healthy status would show the following. The 'Running' status indicates the LPAR has power and the 'sshd' indicates that the EMS and others can ssh to it.
nodestat -p node133node133: sshd(Running)
If the node is powered off (Not Activated) contact your system administrator.
node133: noping(Not Activated)
If the node is powered on (Running) but has status of 'noping' this indicates that it cannot be contacted with ssh.
Next check the LPAR LCD status to obtain additional information. If blank contact the system administrator. They might be able to break into the console (ctrl + \) and check system resources to see what process got the LPAR into this state.
node133: Current LCD: blank
If the LCD shows 0c20 the LPAR has crashed. Here the LPAR has kernel debugger enabled so the node will stay in kdb until further action is taken.
node133: Current LCD: 0c20
Log into the console to check the stack trace. If it looks related to the user's program they should gather the appropriate information and open a defect. If the process looks system or network related contact the system administrator.
<hit enter a few times>
SLOT NAME STATE PID PPID ADSPACE CL #THS
pvproc+30A400 3113*mpi_coll ACTIVE 0290366 0220394 0000000DF0BDF590 449 0001
STATE...... stat :07 .... xstat :0009
FLAGS...... flag :00210001 LOAD EXIT EXECED
........... flag2 :00000001 64BIT
........... flag3 :00000000
........... atomic :00040000 ORPHANPGRP
If needed, take a dump and contact the system administrator so it can be retrieved.
g (start the dump; Should take 5 to 15 minutes)
WARNING: System Crashed!!
To continue this command, you will lose the debug session!!
Do you want to continue? (y/[n]):> y
the console by doing 'ctrl e', release, 'c' then '.'
Obtain and review the dump
Device name: /dev/hdisk384
Major device number: 20
Minor device number: 384
Size: 1019846656 bytes
Uncompressed Size: 9302634058 bytes
Date/Time: Thu Nov 10 08:05:13 2011
Dump status: 0
Type of dump: fw-assisted
dump completed successfully
In a non-hierarchical environment the dump will be located on the EMS. In an hierarchical environment the dump will be located on the service node. The same commands can be executed on the appropriate machine.
| grep dump
cluster_dump resources dump
lsnim -l cluster_dump | grep locationlocation = /install/nim/dump/cluster_dump
cd /install/nim/dump/cluser_dump/node133dmpuncompress dump.2011.11.10.14:09:35.BZ
Move or view the dump.
kdb dump.2011.11.10.14:09:35 /unix
During cluster operation the system administrator may not always be available. This guide explains how a non-root user can investigate and take action in an xCAT managed environment.