Topic
  • 10 replies
  • Latest Post - ‏2012-11-13T18:04:49Z by SystemAdmin
SystemAdmin
SystemAdmin
120 Posts

Pinned topic Help troubleshoot my pureScale setup

‏2012-06-18T20:11:19Z |
I installed DB2 10.1 pureScale on 3 nodes in SLES 11 SP2 and the install was OK with RSCT and GPFS working well. I shut down the machines and restarted them again. The GPFS comes fine but RSCT does not start.

This is what my lssrc -a output shows on all nodes:

Subsystem Group PID Status
ctrmc rsct 2960 active
IBM.ERRM rsct_rm 3073 active
ctcas rsct 3149 active
IBM.SensorRM rsct_rm inoperative
IBM.LPRM rsct_rm inoperative
cthats cthats inoperative
cthags cthags inoperative
cthagsglsm cthags inoperative
IBM.StorageRM rsct_rm inoperative
IBM.RecoveryRM rsct_rm inoperative
IBM.TestRM rsct_rm inoperative
IBM.GblResRM rsct_rm inoperative
IBM.AuditRM rsct_rm inoperative
IBM.ConfigRM rsct_rm inoperative
IBM.HostRM rsct_rm inoperative

I understand that RSCT starts the required subsystems automatically but I am unable to figure out the cause of the above failure where lssam and lsrpdomain output is as under:

node03:~ # lssam
lssam: No resource groups defined or cluster is offline!
node03:~ # lsrpdomain
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.

Even if I start IBM.ConfigRM (Which I am not supposed to), it starts then it seems that ctrmc shuts it down since it shows as inoperative again.

Any help is appreciated. I can open a PMR but it will take days to get to a right person who knows the stuff so putting this here.

Thanks
Updated on 2012-11-13T18:04:49Z at 2012-11-13T18:04:49Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-06-19T03:23:27Z  
    Some more information:

    When I try to run the command:

    
    # startsrc -s IBM.ConfigRM
    


    I see following in syslog.

    
    Jun 18 23:14:06 node02 ConfigRM[10771]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,IBM.ConfigRMd.C,1.57,347                 :::CONFIGRM_STARTED_ST IBM.ConfigRM daemon has started. Jun 18 23:14:07 node02 condrespV10_resp.ksh[10793]: scriptEntry 
    'node02' 
    '2' Jun 18 23:14:07 node02 condrespV10_resp.ksh[10793]: strongQuorum=1 Jun 18 23:14:07 node02 condrespV10_resp.ksh[10793]: Strong Quorum == 1, must be reboot - 
    
    do nothing Jun 18 23:14:07 node02 condrespV10_resp.ksh[10846]: scriptEntry 
    'node02' 
    '' Jun 18 23:14:07 node02 srcmstr: src_error=-9035, errno=0, module=
    'srchevn.c'@line:
    '252', 0513-035 The IBM.ConfigRM Subsystem ended abnormally. SRC will 
    
    try and restart it. Jun 18 23:14:07 node02 condrespV10_resp.ksh[10846]: Error: ERRM_VALUE is not set: determine correct value Jun 18 23:14:07 node02 ConfigRM[10856]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,IBM.ConfigRMd.C,1.57,347                 :::CONFIGRM_STARTED_ST IBM.ConfigRM daemon has started. Jun 18 23:14:07 node02 /var/mmfs/etc/gpfsready[10888]: runact -c IBM.PeerDomain VerifyStrongQuorumState is non-zero!: 1 Jun 18 23:14:07 node02 condrespV10_resp.ksh[10885]: scriptEntry 
    'node02' 
    '2' Jun 18 23:14:07 node02 condrespV10_resp.ksh[10846]: ifconfig eth0 produced no output - Unable to determine adapter status Jun 18 23:14:08 node02 condrespV10_resp.ksh[10935]: scriptEntry 
    'node02' 
    '' Jun 18 23:14:08 node02 srcmstr: src_error=-9035, errno=0, module=
    'srchevn.c'@line:
    '252', 0513-035 The IBM.ConfigRM Subsystem ended abnormally. SRC will 
    
    try and restart it. Jun 18 23:14:08 node02 condrespV10_resp.ksh[10935]: Error: ERRM_VALUE is not set: determine correct value Jun 18 23:14:08 node02 condrespV10_resp.ksh[10885]: strongQuorum=1 Jun 18 23:14:08 node02 ConfigRM[10945]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,IBM.ConfigRMd.C,1.57,347                 :::CONFIGRM_STARTED_ST IBM.ConfigRM daemon has started. Jun 18 23:14:08 node02 condrespV10_resp.ksh[10885]: Strong Quorum == 1, must be reboot - 
    
    do nothing Jun 18 23:14:08 node02 condrespV10_resp.ksh[10975]: scriptEntry 
    'node02' 
    '2' Jun 18 23:14:08 node02 condrespV10_resp.ksh[10935]: ifconfig eth0 produced no output - Unable to determine adapter status Jun 18 23:14:09 node02 condrespV10_resp.ksh[10975]: strongQuorum=1 Jun 18 23:14:09 node02 condrespV10_resp.ksh[10975]: Strong Quorum == 1, must be reboot - 
    
    do nothing Jun 18 23:14:10 node02 srcmstr: src_error=-9020, errno=0, module=
    'srchevn.c'@line:
    '409', 0513-020 The IBM.ConfigRM Subsystem did not end normally. The subsystem respawn limit has been exceeded. Check the Subsystem and restart it manually. Jun 18 23:14:10 node02 condrespV10_resp.ksh[11026]: scriptEntry 
    'node02' 
    '' Jun 18 23:14:10 node02 condrespV10_resp.ksh[11026]: Error: ERRM_VALUE is not set: determine correct value Jun 18 23:14:11 node02 condrespV10_resp.ksh[11026]: ifconfig eth0 produced no output - Unable to determine adapter status Jun 18 23:15:01 node02 su: (to db2psc) root on none Jun 18 23:15:02 node02 sudo:   db2psc : TTY=unknown ; PWD=/home/db2psc ; USER=root ; COMMAND=/bin/df -k
    


    One thing that is of interest is:

    
    node02 condrespV10_resp.ksh[10846]: ifconfig eth0 produced no output - Unable to determine adapter status
    


    But If I run the command ifconfig eth0, I get the following output. Does it mean that the script that comes with condrespV10_resp.ksh is not working?

    
    node02:~ # ifconfig eth0 eth0      Link encap:Ethernet  HWaddr 00:0C:29:16:14:9F inet addr:192.168.142.102  Bcast:192.168.255.255  Mask:255.255.0.0 inet6 addr: fe80::20c:29ff:fe16:149f/64 Scope:Link UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 RX packets:153346 errors:0 dropped:0 overruns:0 frame:0 TX packets:124306 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:47675335 (45.4 Mb)  TX bytes:18589498 (17.7 Mb)
    


    Any clue how to fix this error:

    
    0513-020 The IBM.ConfigRM Subsystem did not end normally. The subsystem respawn limit has been exceeded. Check the Subsystem and restart it manually.
    


    The network between all nodes is fine and working since GPFS is active. If I run the command mmlscluster, this is the output from any node.

    
    node02:~ # mmlscluster   GPFS cluster information ======================== GPFS cluster name:         db2cluster_20120615191919.purescale.ibm.local GPFS cluster id:           13882502421447164663 GPFS UID domain:           db2cluster_20120615191919.purescale.ibm.local Remote shell command:      /usr/bin/ssh Remote file copy command:  /usr/bin/scp   GPFS cluster configuration servers: ----------------------------------- Primary server:    node02.purescale.ibm.local Secondary server:  node03.purescale.ibm.local   Node  Daemon node name            IP address       Admin node name             Designation ----------------------------------------------------------------------------------------------- 1   node02.purescale.ibm.local  192.168.142.102  node02.purescale.ibm.local  quorum-manager 2   node03.purescale.ibm.local  192.168.142.103  node03.purescale.ibm.local  quorum-manager 3   node04.purescale.ibm.local  192.168.142.104  node04.purescale.ibm.local  quorum-manager
    
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-06-24T03:21:11Z  
    Can you please make sure that you have updaed th ksh with required version ?
    I guess you need to update the ksh to ksh-93u-0.6.1.

    We had faced similar errors and got rid of it by updating the ksh to ksh-93u-0.6.1.

    Thanks,
    Prabhu.S
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-07-03T03:39:35Z  
    Did you give a try after updating KSH ?
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-07-03T12:52:06Z  
    Did you give a try after updating KSH ?
    Yes. Tried that too but no help.
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-11-08T09:47:50Z  
    Hello Anabas,

    Did you manage to solve this problem? I am facing exactly the same in a Purescale 10.1 installation with SLES 11 SP2.

    Thanks in advance and best regards,
    Miguel
  • sedgewick_de
    sedgewick_de
    36 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-11-08T11:57:24Z  
    Hi,

    this looks like a bug, so I'm recommending to have a PMR opened against RSCT for this.

    Regards,
    Markus Müller
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-11-08T14:21:44Z  
    Hi,

    this looks like a bug, so I'm recommending to have a PMR opened against RSCT for this.

    Regards,
    Markus Müller
    I opened a PMR on this and the solution was lightening quick and fast. I love it.

    "You are running on a non-supported platform."
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-11-10T14:59:28Z  
    Let me guess... vSphere?
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-11-13T17:45:21Z  
    Let me guess... vSphere?
    Actually not vSphere. At the time when I opened this thread, SLES 11 SP2 was not on the list of supported OS platform for DB2 pureScale so that is why I was told in a firm voice that you can not try this on SLES 11 SP2 and I obeyed them.

    But since then and now, lots of water has flown in river Hudson, I installed SLES 11 SP2 on 4 VMs in VMware workstation 9.0 and installed DB2 10.1 FP1 (Please note FP1) and it went smoothly like this was never a problem.

    So, those who are getting this problem, please use DB2 10.1 FP1.

    PS: I think RSCT still holds the line the product is not supported on VMware workstation but I see no problem using it.
  • SystemAdmin
    SystemAdmin
    120 Posts

    Re: Help troubleshoot my pureScale setup

    ‏2012-11-13T18:04:49Z  
    Actually not vSphere. At the time when I opened this thread, SLES 11 SP2 was not on the list of supported OS platform for DB2 pureScale so that is why I was told in a firm voice that you can not try this on SLES 11 SP2 and I obeyed them.

    But since then and now, lots of water has flown in river Hudson, I installed SLES 11 SP2 on 4 VMs in VMware workstation 9.0 and installed DB2 10.1 FP1 (Please note FP1) and it went smoothly like this was never a problem.

    So, those who are getting this problem, please use DB2 10.1 FP1.

    PS: I think RSCT still holds the line the product is not supported on VMware workstation but I see no problem using it.
    Oops. I did not read my own post as what was the root problem. Sorry for being too knee-jerk.

    I just checked that after I rebooted all machines, the DB2 will not come up and the problem still remains even if the install was fine and DB2 started properly after the install.

    Alas. It did not get fixed even in DB2 10.1 FP1.