CCR issues
When CCR loses its quorum, most of the IBM Storage Scale administrative commands do not work any longer. This is because the mm-commands use CCR to ensure that they are working with the most recent version of the configuration data.
The following tips help you to identify and avoid potential issues that might affect CCR:
- CCR communicates through port 1191. Before you create a new cluster or add a new node to an existing cluster, make sure that this port is not blocked by a firewall.
- CCR does IP address lookup to communicate between CCR servers or between a CCR client and server. Make sure that /etc/hosts entries (name resolution in general) for the quorum nodes are consistent on all nodes in the cluster.
- CCR must work even if GPFS is shut down. When GPFS is shut down, a separate daemon, the mmsdrserv daemon, is started to provide CCR services instead of the mmfsd daemon. The mmsdrserv daemon has its own log file and it is available at /var/adm/ras/mmsdrserv.log. Examine this log file on the quorum nodes if CCR runs into issues when GPFS is down.
- The mmccr check command can be used on any node in the cluster to verify
whether the CCR is accessible on a particular node and works as expected. The checks are initiated
by the CCR client and the CCR server responds where necessary. The CCR check generates two types of
output depending on whether the CCR command is issued on a quorum node or a non-quorum node. A
sample output is as follows:
# mmccr check -Ye mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity: mmccr::0:1:::1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/ccr.nodes,Security,/var/mmfs/ccr/ccr.disks:OK: mmccr::0:1:::1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK: mmccr::0:1:::1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK: mmccr::0:1:::1:FC_CCR_PAXOS_12:0:::/var/mmfs/ccr/ccr.paxos.1,/var/mmfs/ccr/ccr.paxos.2:OK: mmccr::0:1:::1:PC_LOCAL_SERVER:0:::node-21.localnet.com:OK: mmccr::0:1:::1:PC_IP_ADDR_LOOKUP:0:::node-21.localnet.com,0.000:OK: mmccr::0:1:::1:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22:OK: mmccr::0:1:::1:FC_COMMITTED_DIR:0::0:7:OK: mmccr::0:1:::1:TC_TIEBREAKER_DISKS:0:::1:OK:
The CCR check on the non-quorum node displays an output similar to this:# mmccr check -Ye mmccr::HEADER:version:reserved:reserved:NodeId:CheckMnemonic:ErrorCode:ErrorMsg:ListOfFailedEntities:ListOfSucceedEntities:Severity: mmccr::0:1:::-1:CCR_CLIENT_INIT:0:::/var/mmfs/ccr,/var/mmfs/ccr/committed,/var/mmfs/ccr/ccr.nodes,Security:OK: mmccr::0:1:::-1:FC_CCR_AUTH_KEYS:0:::/var/mmfs/ssl/authorized_ccr_keys:OK: mmccr::0:1:::-1:FC_CCR_PAXOS_CACHED:0:::/var/mmfs/ccr/cached,/var/mmfs/ccr/cached/ccr.paxos:OK: mmccr::0:1:::-1:PC_QUORUM_NODES:0:::10.0.100.21,10.0.100.22:OK:
The following list provides descriptions for each CCR check item:- CCR_CLIENT_INIT
- Verifies whether the CCR directory structure and files are complete and intact. It also verifies whether the security layer that the CCR is using (GSKit) can be initialized successfully.
- FC_CCR_AUTH_KEYS
- Verifies that the CCR key file needed for authentication by the GSKit layer is available.
- FC_CCR_PAXOS_CACHED and FC_CCR_PAXOS_12
- Verify whether the CCR Paxos state files are available. On quorum nodes, these files are used during CCR's consensus protocol. That is, a cached copy on every node in the cluster is used to speed up the process in certain cases.
- PC_LOCAL_SERVER
- Pings the CCR server that is running on the local quorum node by sending a simple authenticated RPC through the configured IP address for the quorum node. This check item is applicable only for quorum nodes.
- PC_IP_ADDR_LOOKUP
- Measures the time the IP address lookup needs during the former simple ping RPC to the local CCR server. If it exceeds a certain threshold, currently 5 seconds, this check returns a warning. This check item is applicable only for quorum nodes.
- PC_QUORUM_NODES
- Pings all specified quorum nodes by sending a simple RPC through their configured IP addresses. CCR uses the /var/mmfs/ccr/ccr.nodes file to look up the quorum nodes.
- FC_COMMITTED_DIR
- Verifies the integrity of the files in the /var/mmfs/ccr/committed directory. This check item is applicable only for quorum nodes.
- TC_TIEBREAKER_DISKS
- Verifies whether the CCR server has access to the configured tiebreaker disks. This check item is applicable only for quorum nodes.
- The mmccr echo command can be used to send a simple test string to the
specified CCR server, as shown in the following example:
# mmccr echo -n node-21,node-22 testString echo testString echo testString
With this command, the CCR client/server connection can be tested. If the server in the specified node list does not echo the testString , then it means that the connection between this client and server is not working. In such scenarios, check whether the port 1191 is blocked by the firewall or wrong IP address lookup due to inconsistent /etc/hosts entries.
- The CCR_DEBUG environment variable can be used in a CCR client command to
print detailed console output for debug purposes, as shown in the following
example:
# CCR_DEBUG=9 mmccr echo Using /var/mmfs/ccr Size of file '/var/mmfs/ccr/ccr.nodes': 76 bytes readNodeList('/var/mmfs/ccr/ccr.nodes') size 2 COMMIT_PENDING NO ccrFileCommitVersion -1 nErr 0 updateNodeMaps: nodes: [(1, 'node-21.localnet.com', ('10.0.100.21', 1191)), (2, 'node-22.localnet.com', ('10.0.100.22', 1191))] getNodeDataFromFile: nodeId: 5 daemonIpAddr: 10.0.100.25 adminNodeName: node-25.localnet.com daemonNodeName: node-25.localnet.com quorum node: No Reading '/var/mmfs/gen/mmfsNodeData' returned 0 '/var/mmfs/ccr/ccr.disks' does not exist ccrio init: ip= port=1191 node=-1 (0) setCcrSecurity: ccrSecEnabled=1 Auth: secReady 1 cipherList 'AUTHONLY' keyGenNumber 2 connecting to node 127.0.0.1:1191 (timeout -1, handshaketimeout -1) No cached out-connection found for address: 127.0.0.1:1191 (0) connected to node 127.0.0.1:1191 (sock 4-0x55fbcad5bdf0) sending msg 2 'debug' len 6 (sock 4) msgCnt 44065 TO -1 sent msg 2 'debug' (sock 4) ok receiving from 127.0.0.1:1191 (sock 4) TO -1 0x55FBCAD5BDF0 waiting for hdr (len 8) In receive after recvRpcHdr sock 4 hdr '0x04 0x00 0x00 0x00 0x00 0x02 0x00 0x01' (8 bytes) waiting for data (len 4) received msg 0 'ok' len 4 (sock 4) msgCnt 1 closing connection to 127.0.0.1:1191 (sock 4-0x55fbcad5bdf0) closeSocket: shutdown socket 4 returned rc 0 closeSocket: close socket 4 (linger: Yes) returned rc 0 debug response: err 0 type 0 (ok) len 4 echo Command 'echo' returned err: 0 exit code: 0