Topic
  • 2 replies
  • Latest Post - ‏2013-01-22T19:04:07Z by SystemAdmin
SystemAdmin
SystemAdmin
2092 Posts

Pinned topic Problem starting GPFS (starts fine on the first server only)

‏2013-01-22T18:33:03Z |
Hi,

Here is a question regarding a server-to-server communication problem, any hints are greatly appreciated.

Many thanks,

Martin


Summary: we are able to start the GPFS daemon on one (even remote) server, but when attempting to start it on the second server in the cluster, the mmfsd process ends up waiting forever for communication with the other server in the cluster.

Details:

We run RHEL 6.3:

root@abc271l ~# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.3 (Santiago)

GPFS is GPFS-3.5.0.7 (current patch level from Dec 2012)

The cluster setup is very simple, just 2 servers:

root@abc271l ~# mmlsconfig Configuration data 

for cluster abc270l.<domain_name>:   myNodeConfigNumber 2 clusterName abc270l.<domain_name> clusterId 10889494516390018906 autoload no dmapiFileHandleSize 32 minReleaseLevel 3.5.0.7 adminMode allToAll   File systems in cluster abc270l.<domain_name>: ----------------------------------------------- (none)   root@abc271l ~# cat /var/mmfs/gen/mmsdrfs   %%9999%%:00_VERSION_LINE::1323:3:4::lc:abc270l.<domain_name>:::/usr/bin/ssh:/usr/bin/scp:10889494516390018906:lc2:1358806878::abc270l.<domain_name>:0:0:0:0:::::0.0: %%home%%:03_COMMENT::1: %%home%%:03_COMMENT::2:    This is a machine generated file.  Do not edit! %%home%%:03_COMMENT::3: %%home%%:03_COMMENT::4:2013.01.21.17.21.18:2:1:mmcrcluster -N abc270l:manager-quorum,abc271l:manager -p abc270l -r /usr/bin/ssh -R /usr/bin/scp %%home%%:03_COMMENT::5:2013.01.21.17.22.29:3:1:mmchlicense server -N abc270l.<domain_name>,abc271l.<domain_name> %%home%%:03_COMMENT::6:2013.01.21.17.28.38:4:1:mmchconfig adminMode=allToAll %%home%%:10_NODESET_HDR:::2:TCP::1191:::11:1323:1323:L:4:::22:22::::::::: %%home%%:20_MEMBER_NODE::1:1:abc270l:<address_fragment>6:abc270l.<domain_name>:manager::::::abc270l.<domain_name>:abc270l:1323:3.5.0.7:Linux:Q::::::server:: %%home%%:20_MEMBER_NODE::2:2:abc271l:<address_fragment>7:abc271l.<domain_name>:manager::::::abc271l.<domain_name>:abc271l:1323:3.5.0.7:Linux:N::::::server:: %%home%%:70_MMFSCFG::1:#   ::::::::::::::::::::::: %%home%%:70_MMFSCFG::2:#   WARNING:   This is a machine generated file.  Do not edit!      :::::::::::::::::::::: %%home%%:70_MMFSCFG::3:#   Use the mmchconfig command to change configuration parameters.  ::::::::::::::::::::::: %%home%%:70_MMFSCFG::4:#   ::::::::::::::::::::::: %%home%%:70_MMFSCFG::5:clusterName abc270l.<domain_name>::::::::::::::::::::::: %%home%%:70_MMFSCFG::6:clusterId 10889494516390018906::::::::::::::::::::::: %%home%%:70_MMFSCFG::7:autoload no::::::::::::::::::::::: %%home%%:70_MMFSCFG::8:dmapiFileHandleSize 32::::::::::::::::::::::: %%home%%:70_MMFSCFG::9:minReleaseLevel 1323 3.5.0.7:::::::::::::::::::::::     [root@abc271l ~]# mmlscluster   GPFS cluster information ======================== GPFS cluster name:         abc270l.<domain_name> GPFS cluster id:           10889494516390018906 GPFS UID domain:           abc270l.<domain_name> Remote shell command:      /usr/bin/ssh Remote file copy command:  /usr/bin/scp   GPFS cluster configuration servers: ----------------------------------- Primary server:    abc270l.<domain_name> Secondary server:  (none)   Node  Daemon node name        IP address   Admin node name         Designation -------------------------------------------------------------------------------- 1   abc270l.<domain_name>  <address_fragment>6  abc270l.<domain_name>  quorum-manager 2   abc271l.<domain_name>  <address_fragment>7  abc271l.<domain_name>  manager


Starting with GPFS down on both servers:

root@abc271l ~# mmgetstate -a -L   Node number  Node name       Quorum  Nodes up  Total nodes  GPFS state  Remarks ------------------------------------------------------------------------------------ 1      abc270l            0        0          2       down        quorum node 2      abc271l            0        0          2       down


First, start GPFS on the other server:

root@abc271l ~# mmstartup -n abc270l Tue Jan 22 11:14:44 EST 2013: mmstartup: Starting GPFS ...


This seems to work fine:

root@abc271l ~# mmgetstate -a -L   Node number  Node name       Quorum  Nodes up  Total nodes  GPFS state  Remarks ------ 1      abc270l            1        1          2       active      quorum node 2      abc271l            0        0          2       down


BTW: we've verified that ssh and scp works fine both way without asking for passwords etc. (based on the "authorized_keys" files in the ".ssh" directories.) We've also verified that /var/mmfs/gen/mmsdrfs and other settings are propagated correctly from the primary server.)

Now let's start the GPFS daemon on the local machine:


root@abc271l ~# mmstartup -n abc271l Tue Jan 22 11:15:31 EST 2013: mmstartup: Starting GPFS ...   [root@abc271l ~]# mmgetstate -a -L Node number  Node name       Quorum  Nodes up  Total nodes  GPFS state  Remarks ----- 1      abc270l            1        1          2       active      quorum node 2      abc271l            1        0          2       arbitrating


Not very good...
In fact, the process seems to be stuck waiting for communication with the other server:


root@abc271l ~# cat /var/adm/ras/mmfs.log.latest Tue Jan 22 11:15:33 EST 2013: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: Unloading modules from /lib/modules/2.6.32-279.19.1.el6.x86_64/extra Loading modules from /lib/modules/2.6.32-279.19.1.el6.x86_64/extra Module                  Size  Used by mmfs26               1749012  0 mmfslinux             310838  1 mmfs26 tracedev               29456  2 mmfs26,mmfslinux Tue Jan 22 11:15:37.272 2013: mmfsd initializing. 
{Version: 3.5.0.7   Built: Dec 12 2012 19:00:50
} ... Tue Jan 22 11:15:48.974 2013: Connecting to <address_fragment>6 abc270l <c0p0>


And that's where the mmfsd process seems get stuck.
Updated on 2013-01-22T19:04:07Z at 2013-01-22T19:04:07Z by SystemAdmin
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Problem starting GPFS (starts fine on the first server only)

    ‏2013-01-22T18:51:44Z  
    I would guess that you have a firewall blocking connections to port 1191 that the GPFS daemons use to talk to each other.
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: Problem starting GPFS (starts fine on the first server only)

    ‏2013-01-22T19:04:07Z  
    Nothing like talking to real experts - the firewall suggestion worked.

    Many thanks,

    Martin