Checking the status of RMC connections

The lssyscfg and lspartition commands provides RMC connection status.

You can check the RMC connection status by running one of the following commands:

lssyscfg -r lpar -m frame-name -F lpar_id,state, rmc_state,rmc_ipaddr,
os_version,dlpar_mem_capable,dlpar_proc_capable,dlpar_io_capable 
--filter "lpar_ids=LP_ID"

This command provides RMC connection status and the operating system capabilities. Example output follows:

hscroot@myhmc:~> lssyscfg -r lpar -m Frame3-top-9117-MMC-SN10364E7 
-F lpar_id, state, rmc_state,rmc_ipaddr,os_version,dlpar_mem_capable,
dlpar_proc_capable,dlpar_io_capable 
--filter "lpar_ids=3"

3,Running,active,10.32.244.214,AIX 6.1 6100-06-06-1140,1,1,1

lspartition -dlpar

This command is an internal command. However, it is useful for RMC troubleshooting because it provides the raw RMC connection data. Example output follows:

	hscroot@myMC:~> lspartition -dlpar | fgrep 214 -A1
	<#4> Partition:<3*9117-MMC*10364E7, mycompany.com, 10.32.244.214>
	Active:<1>, OS:<AIX, 6.1, 6100-06-06-1140>, DCaps:<0x2c5f>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<1356>

Diagnosis

During the first level of verification, you can diagnose the RMC connection issues in the following ways:

If an active partition has RMC Active:<0> in the lspartition command output, refer to the detailed diagnostics to address common RMC connection issues.
If the lspartition command displays an RMC connection as Active<1> but the lssyscfg command displays none or inactive, the data that supports these two commands are not in agreement. In this case, perform the server rebuild operation on the server or restart the HMC. This operation brings the connection status data back in agreement.

Detailed diagnosis for RMC connection issues

Notes:

The diagnosis assumes that the RMC subsystem is using TCP and UDP ports 657 for the communication between HMC and partitions.
Typically, more than one Ethernet adapters exist on the HMC. If an adapter is designated for partition communication on the HMC graphical user interface (GUI), its IP addresses are ordered first in the IP address list. The RMC component on the operating system attempts to establish a single connection that starts with the first IP address on the list. If no connection is established with that IP address, the next IP address is attempted until a successful connection is established.

Some of the common issues that cause an inactive RMC connection follow:

Verifying server connection states

You can verify all the managed servers on HMC have good connections to the service processor on the private service network by running the lssyscfg command.

hscroot@myMC:~> lssyscfg -r sys -F name,type_model,serial_num,state
	9.3.206.220,9179-MHD,1003EFP,No Connection
	9.3.206.223,9179-MHD,1038D0P,No Connection

The following states indicate good connections:

Operating
Standby
Power Off
Error
Other transient states, for example, Powering On

The following states indicate problems:

Incomplete
No Connection
Recovery

Note: In HMC Version 7.7.7.0 Service Pack 1, and earlier, the existing connections are removed in a server that is in

No
Connection

, Incomplete, or Error states and these servers prevent connections for newly activated partitions. This restriction does not apply to HMC Version 7.7.7.0 Service Pack 2, and later.

Diagnosis

If server connection state is Incomplete, perform a server rebuild operation:
```
 hscroot@trucMC:~> chsysstate -r sys -o rebuild -m CEC_name
```
If server connection state is No Connection, resolve or remove the connection. Common issues that cause No Connection follow:
- Improper firewall configuration on the network from HMC to the Fiber Service Platform (FSP).
- More than two HMCs are attempting to manage the server.

Verifying the IP addresses used for RMC connections

List the HMC IP addresses by using the lshmc HMC command. In this example, the HMC has two network adapters that have IPv4 and IPv6 addresses:

  hscroot@myMC:~> lshmc -n -F ipaddrlpar,ipaddr,ipv6addrlpar
  9.53.202.86,9.53.202.86,9.53.202.87,fe80:0:0:0:20c:29ff:fedb:4816,
  fe80:0:0:0:20c:29ff:fedb:4817

The lshmc command output lists the IP addresses that partitions use to establish RMC communication with the HMC. The ipaddrlpr parameter is the preferred IP address that is used to establish the connection. If a connection is not established with this IP address, RMC attempts connections on the other IP addresses in the listed order.

Diagnosis

If the IP addresses listed in this command are not correct, one or more of the HMC network interfaces is configured incorrectly.

Verifying RMC port configuration

Verify that RMC is accepting requests from both TCP and UDP 657 ports by using the netstat HMC command:

	hscroot@truchmc:~> netstat -tulpn | grep 657
	tcp    0    0 :::657     :::*   LISTEN   -
	udp    0    0 :::657     :::*            -

Diagnosis

If one of the entries is not listed, restart the HMC.

Verifying the RMC port for each partition

Verify whether the partition's firewall is open and authenticated for port 657 and is accessible from the HMC by using the telnet or ssh commands from the HMC to establish a connection to the partition to verify the network and authenticate the firewall.

	hscroot@truchmc:~># ssh lpar_host name|IP

This verification must be repeated for each partition as necessary.

Diagnosis

From the HMC GUI, click HMC Management → Change Network Settings → LAN Adapter/Details → Firewall Settings, and then select Allow RMC.

Verifying the HMC RMC port from each partition

Verify whether the HMC firewall is open and authenticated for port 657 and accessible from one or more partitions.

From the partition, use the telnet command to verify whether the HMC port 657 is open for RMC's use.

	#telnet HMC_host name | IP 657

Diagnosis

The following problems can indicate the RMC port communication issues:

RMC ports, specifically TCP 657, is not enabled in the HMC firewall.
Navigate to the HMC firewall as described earlier and enable the RMC port.
RMC has an issue that it does not communicate to TCP 657.
Restart HMC to restart the RMC subsystem.

Verifying partition file systems

Verify whether the partition's /var and /tmp file systems are not full.

On each RS/6000® Platform Architecture (RPA) partition that does not have RMC connection to the HMC, use the df command to display the file system usage.

# df
Filesystem   ... Use% Mounted on
/dev/hda2    ...  44% /
/dev/hda3    ...  23% /var
...

Diagnosis

If the /var or /tmp file system is 100% full, remove unnecessary files or increase the file system sizes by using the smitty or equivalent Linux® commands.

After changes are complete to increase the space in /var file system, run the following commands to fix the potentially corrupted files.

# rmrsrc -s "Hostname!='t' " IBM.ManagementServer
# /opt/rsct/bin/rmcctrl -z
# rm /var/ct/cfg/ct_has.thl
# rm /var/ct/cfg/ctrmc.acls
# /opt/rsct/bin/rmcctrl -A

Checking for reused IP addresses

Similar to the Duplicate NodeId state, reused or recycled IP addresses among partitions can cause an HMC error if a new partition connection is established while the old (probably inactive) connection still exists.

The lssyscfg -r lpar HMC command can be used to list all the IP addresses for all RMC connections. When this list is sorted, duplicate RMC addresses are listed adjacent and can be identified.

lssyscfg -r lpar -m CEC_name -F rmc_ipaddr,lpar_id,name,state,rmc_state | sort

When you scan the list, you can identify the duplicate addresses as consecutive entries with the same first parameter (RMC IP address).

Diagnosis

If a duplicate address is identified, determine which IP address is valid or expected and which IP address is invalid or stale. To correct the problem, complete the following steps:

On the HMC, unmanage the server corresponding to the stale RMC connection by running the following command:
```
rmsysconn –ip CEC_IP
```
Wait for 6 minutes or more, then start managing the server again by running the following command:
```
mksysconn -ip CEC_IP
```

Checking for MTU size mismatch

Most of the current versions of RMC require all parties to use the same maximum transmission unit (MTU) size. The recommended MTU setting for RMC on both HMC and partitions is 1500. If jumbo frames are required, all parties on that network must use jumbo frames.

You can use different MTU sizes on other network interfaces. For example, if different HMC network adapters are used for the two networks, jumbo frames can be used on the HMC to server (Fiber Service Platform (FSP) network) while regular frames (MTU size = 1500) can be used for RMC communication.

Different MTU settings between HMC and the partitions results in a No Connection condition and an indefinite hang in the partition. This type of hang is recreatable by using VIOS lsmap -all command in a large system that produces a large output and requires multiple packages to be transferred between HMC and VIOS.

To check MTU size on partitions, run the following command:

#ifconfig | fgrep MTU
UP BROADCAST RUNNING MULTICAST  MTU:1500

To check whether jumbo frame is enabled on HMC, run the following command:

#lshmc -n
hostname=myhmc,...,jumboframe_eth0=off,lparcomm_eth0=off,..,jumboframe_eth1=on,lparcom_eth1=on

Diagnosis

The issue can be addressed by either changing the incorrect MTU sizes or by changing the HMC network interface that is used for RMC communication. To designate a different Ethernet adapter for partition communication, you can use one of the following options:

Run the chhmc HMC command.
Use the HMC GUI (HMC Management -> Change Network Settings).

Checking for duplicate node ID on the partitions

RMC uses a unique node ID to identify partitions. Having more than one partition with the same node ID can cause an RMC error.

If a partition is cloned improperly, it can have a duplicate node ID from the cloned partition, causing intermittent enabled or disabled connections between the partitions. The connections are also disabled for all partition that share the duplicate node ID.

To determine whether duplicate Node IDs exist, consider the following options:

For partitions with active RMC connections:
From the HMC, as root user, run the /opt/rsct/bin/rmcdomainstatus -s ctrmc command and identify any duplicate entries. If HMC is managing a large number of partitions, it might be a difficult task.
On partitions without an active RMC connection:
Compare the /etc/ct_node_id file manually in each partition.

Diagnosis

To repair duplicate node IDs, complete the following steps on the partitions that have duplicate node IDs:

Remove the /etc/ct_node_id file, and then run the recfgct command to generate a new node ID.
Note: You must run the recfgct command only if you do not have any high availability clusters set up on this node that uses the IBM® PowerHA SystemMirror® or IBM Tivoli® System Automation for Multiplatforms (SAMP) products.

If the LPARs are running AIX® 6 with 6100-07, or later, run the following command:

odmdelete -o CuAt -q name=cluster0 to remove 'cluster0' entry from the CuAt ODM. 
/opt/rsct/install/bin/recfgct

If the LPARs are running AIX 6 with 6100-06, or earlier, run the following command:
```
/opt/rsct/install/bin/recfgct
```