In an xCAT managed
environment the LPARs are managed by an executive management server
(EMS) and a service node (SN), if it is a hierarchical environment.
By default non-root users cannot execute xCAT commands or perform
xdsh operations from these systems. A subset of xCAT commands may be
useful to users that need to investigate an unresponsive node. The
following sections will explain how the system administrator can
setup access for a non-root id (in advance) and then how users can
use this access to troubleshoot failing LPARs.
This example uses an AIX
environment, but it can be modified easily for another operating
system including Linux. It is based off the xCAT procedure for
granting privileges to non-root users
section describes the high levels steps from the xCAT documentation
which are executed on the EMS. Start by creating a non-root user id
including a valid home directory. Then run the following command to
create the necessary SSL certificates.
these files are located in the user's $HOME/.xcat directory.
the policy table with this entry. Here we are telling xCAT that the
userid1 can execute four commands only. The user will receive a
permission denied error if any other xCAT commands are issued.
the user's path with to include the xCAT directories and run a simple
command to validate the configuration.
$PATH | grep xcat
step to setup 'xdsh' is skipped because it is not needed for this
This access can be setup
alternatively on a service or login node. It requires a few
additional steps to setup SSH keys and install a subset of the xCAT
RPMs on the login node (if selected). Refer to the xCAT procedure
located in the Resources section for more information.
Checking a node status
unresponsive node is an LPAR that you cannot ssh or telnet to. Log
into the designated machine (EMS, SN or Login node) with your
non-root user. First check to see if the LPAR has a known issue. For
example the system administrator may have document this in the
/etc/motd or created a node list with LPARs that are broken.
the node does not appear on this list proceed with the following
instructions. First check the power status. A healthy status would
show the following. The 'Running' status indicates the LPAR has power
and the 'sshd' indicates that the EMS and others can ssh to it.
the node is powered off (Not Activated) contact your system
If the node is powered on
(Running) but has status of 'noping' this indicates that it cannot be
contacted with ssh.
check the LPAR LCD status to obtain additional information. If
blank contact the system administrator. They might be able to break
into the console (ctrl + \) and check system resources to see what
process got the LPAR into this state.
Current LCD: blank
the LCD shows 0c20 the LPAR has crashed. Here the LPAR has kernel
debugger enabled so the node will stay in kdb until further action is
Current LCD: 0c20
into the console to check the stack trace. If it looks related to the
user's program they should gather the appropriate information and
open a defect. If the process looks system or network related contact
the system administrator.
node133<hit enter a few times>
SLOT NAME STATE PID PPID
ADSPACE CL #THS
pvproc+30A400 3113*mpi_coll ACTIVE 0290366
0220394 0000000DF0BDF590 449 0001
STATE...... stat :07 .... xstat
FLAGS...... flag :00210001 LOAD EXIT EXECED
flag2 :00000001 64BIT
........... flag3 :00000000
atomic :00040000 ORPHANPGRP
needed, take a dump and contact the system administrator so it can be
g (start the dump; Should take 5 to 15 minutes)
To continue this command, you will lose the debug
Do you want to continue? (y/[n]):> y
the console by doing 'ctrl e', release, 'c' then '.'
Obtain and review the
the LPAR has finished dumping it will reboot. Telnet or ssh to it and
query the dump status.
Device name: /dev/hdisk384
Major device number: 20
device number: 384
Size: 1019846656 bytes
Date/Time: Thu Nov 10 08:05:13 2011
Type of dump: fw-assisted
dump completed successfully
non-hierarchical environment the dump will be located on the EMS. In
an hierarchical environment the dump will be located on the service
node. The same commands can be executed on the appropriate machine.
| grep dump
-l cluster_dump | grep locationlocation
or view the dump.
During cluster operation
the system administrator may not always be available. This guide
explains how a non-root user can investigate and take action in an
xCAT managed environment.