Topic
26 replies Latest Post - ‏2012-11-16T16:32:03Z by jerberstark
stevedan
stevedan
3 Posts
ACCEPTED ANSWER

Pinned topic What's the best way to cause a system to create a crashdump in case of a system hang?

‏2012-09-07T18:35:35Z |
 I've been able to set up some of our systems with a kdump installation and get crashdumps.  I believe that what we are getting instead of system crashes is system hangs.  I think that what we may need is a way of causing a crash when the system is unresponsive and we can't get to a command line to initiate a crash.  In some situations we might want to cause a crash to get a vmcore file on an unattended system that is hung.
 
I am considering the following methods of forcing a crash:
 
1.  Using the service processor from a remote system to force a crash over the network.  I know a system can be powered down this way, but we need to initiate a crashdump.
 
2.  Seems like I tried to get a NMI (non maskable interrupt) to work before using the sysctl setup without success. 
 
3.  Using the IPMI interface to set up a watchdog timer to the hardware.  This should, if it works, get a system back into operation as well as get us a crashdump.
 
My question is if you have any caveats for these features working on our Linux systems running on the Power 5 ppc64 systems.  Or stated differently, is there a recommended method to create a crashdump for Linux on Power.
Updated on 2012-11-16T16:32:03Z at 2012-11-16T16:32:03Z by jerberstark
  • Brian_King
    Brian_King
    14 Posts
    ACCEPTED ANSWER

    Re: What's the best way to cause a system to create a crashdump in case of a system hang?

    ‏2012-09-07T19:49:22Z  in response to stevedan
     The two methods I've used to trigger a crashdump are:
     
    1. Via sysrq:
     a. Enable sysrq: echo 1 > /proc/sys/kernel/sysrq
     b. Trigger the crashdump at the Linux LPAR console via: ctrl-o c
    2. Via the management console. Select the LPAR and issue a "dump restart".
    • robinwcox2
      robinwcox2
      11 Posts
      ACCEPTED ANSWER

      Re: What's the best way to cause a system to create a crashdump in case of a system hang?

      ‏2012-09-07T20:41:06Z  in response to Brian_King
       Re: a.  I'm getting the impression that it's too late to enter commands once the system is hung. 
       
      Need more info on what LPAR is in relation to RH5 Linux on Power5.  We no longer have working HMCs.
       Where would I go to put this in our context?
      • Bill_Buros
        Bill_Buros
        85 Posts
        ACCEPTED ANSWER

        Re: What's the best way to cause a system to create a crashdump in case of a system hang?

        ‏2012-09-07T21:08:19Z  in response to robinwcox2
        I assume you mean your Power5 is a single-system install of RHEL5.     In that case, the "LPAR" is that single-system image.
  • hbabu
    hbabu
    8 Posts
    ACCEPTED ANSWER

    Re: What's the best way to cause a system to create a crashdump in case of a system hang?

    ‏2012-09-07T20:19:03Z  in response to stevedan
     During system hard hang situations, soft-reset is the only way and reliable - means 'dump restart' from HMC if HMC is used (Ex: select 'operations' for the specific LPAR -> 'restart' and 'dump') or you can use NMI from service processor  / ASM. If you do not have any of these interfaces, you can press yellow button on the system softly. Note that hard pressing this button reboots the system. So HMC or ASM interfaces are best options.
     
    First please check whether kdump is setup properly.  'cat /sys/kernel/kexec_crash_loaded' should return 1. Or you can take the test dump using ' echo 1 > /proc/sys/kernel/sysrq and echo c > /proc/sysrq-trigger'
     
     
    • robinwcox2
      robinwcox2
      11 Posts
      ACCEPTED ANSWER

      Re: What's the best way to cause a system to create a crashdump in case of a system hang?

      ‏2012-09-07T20:42:53Z  in response to hbabu
       Already plan to check out using the service processor.  (Item 1 at top.)  Have no working HMC.
      • robinwcox2
        robinwcox2
        11 Posts
        ACCEPTED ANSWER

        Re: What's the best way to cause a system to create a crashdump in case of a system hang?

        ‏2012-09-10T21:16:05Z  in response to robinwcox2
         I've seen reference to wd_keepalive (a simplified watchdog daemon),  watchdog man pages, IPMI watchdog on:
         
                 
         
        So far, the documentation doesn't say which platforms this is applicable.  Can I assume that it all applies to the P5? Or would it apply to ALL IBM platforms?
         
        The watchdog daemon mentions interfacing with the hardware, but I don't  see where it says if this is automatic or not or describes how to interface with the hardware watchdog.  Just scratching the surface relative to IPMI.
        • jerberstark
          jerberstark
          30 Posts
          ACCEPTED ANSWER

          Re: What's the best way to cause a system to create a crashdump in case of a system hang?

          ‏2012-09-10T23:04:20Z  in response to robinwcox2
          Hi Robin,
           
          The blueprint you linked to is written for System x. Below is a snippet from the Scope, requirements, and support page of the blueprint.  I also looked at the Supported features for PowerLinux systems, and we don't have IPMI listed there. Hence, it looks like IPMI isn't supported on PowerLinux systems. If someone else from the team has other info, I will gladly update the documentation. 
           

          Hardware requirements

          The hardware for this blueprint include IPMI hardware with RHEL 5.2 or SLES 10.2 installed. If you are planning to install the latest version of IPMItool, you will need the Development Tools package group if your machine is running on RHEL. For SLES, the C/C++ Complier & Tools package pattern is sufficient.

          Note that all the instructions here are based on IPMI 2.0 hardware.

          For more information about servers that contain BMCs and thus support IPMI, see Appendix D: System Management overview in the IBM System x Online Configuration and Options Guide (COG) at http://www.ibm.com/systems/xbc/cog/appendixD/appxsysmgmtsupport.html.

          This blueprint was tested on System x stand-alone servers and IBM BladeCenter servers with BMC hardware

          To discover the IPMI version (1.5 or 2.0) on your server, run the command:

          # ipmitool mc info
           
           
          • robinwcox2
            robinwcox2
            11 Posts
            ACCEPTED ANSWER

            Re: What's the best way to cause a system to create a crashdump in case of a system hang?

            ‏2012-09-10T23:19:29Z  in response to jerberstark
             Thanks.
             
            I can do a man command on IPMI and get results.  I don't see anything related to watchdog within it.
             
            This leaves open as to whether the other watchdog interface with the hardware.  I don't know if the kernel is running when it hangs.  We can't get to any command line to poke 1 into /proc/sys/kernel/sysrq.  If the system is so hung, the kernel is hung, then if we can't set up a hardware watchdog, we might not be able to cause a crash.
             
            Tried the above command on P5 got:
             
             Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0:  No such file or directory
            Get Device ID command failed
        • hbabu
          hbabu
          8 Posts
          ACCEPTED ANSWER

          Re: What's the best way to cause a system to create a crashdump in case of a system hang?

          ‏2012-09-10T23:30:34Z  in response to robinwcox2
           Generally IPMI is used to send NMI on x86/x86_64 systems since they do not have other way of sending NMI. The watchdog daemon is monitor the system and reboots it for hang scenarios. I think user has to send NMI using IPMI tool for take crash dump on these systems. I do not think IPMI is supported on power. 
           But as I mentioned above, we have reliable way of taking dump for hang systems using other methods.
          - Using HMC : Since your system not connected with HMC, it is not right option for you.
          - service processor : ASM interface. We can initiate dump with this interface remotely. Not sure what option/ interface did you used? Can you explain?
          - Press yellow button softly on the system.
           
           
           
          • This reply was deleted by MaheshSal 2012-09-11T07:27:32Z.
            • stevedan
              stevedan
              3 Posts
              ACCEPTED ANSWER

              RE:What's the best way to cause a system to create a crashdump in case of a system hang?

              ‏2012-09-12T15:09:11Z  in response to MaheshSal
              There was a post here about NMI Watchdog works under PowerLinux which referenced http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaai/crashdump/liaaicrashdumpnmiwatch.htm
               
              based on the post above is it safe to assume that this page is only for x86 and not PowerLinux? If so we need to update this documentation to stated this.
              • jerberstark
                jerberstark
                30 Posts
                ACCEPTED ANSWER

                Re: RE:What's the best way to cause a system to create a crashdump in case of a system hang?

                ‏2012-09-12T15:26:55Z  in response to stevedan
                 Hi stevedan,
                 
                I agree that the doc team needs to comb through these blueprints and make it more clear which apply to Power systems running Linux, and which apply only to System x.  In your opinion, do you think we need to include this information on every page within the blueprint, or would it be adequate to update the Scope, requirements, and support topic that is contained in each blueprint?
                 
                 I checked the blueprint you're referencing, and it does state in the Scope, Requirement, and Support section that this wasn't tested on a Power system; but you're right that this statement doesn't exactly convey that NMI watchdog isn't supported on a Power system.
                 http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaai/crashdump/liaaicrashdumpintro.htm

                Hardware and software requirements

                The instructions in this blueprint are written for Kdump servers and clients running the Red Hat Enterprise Linux (RHEL) 5.3 or SLES 10 SP2 operating systems. The Kdump server should have enough storage to receive the crash dumps from the clients.

                Kdump clients are tested on IBM System x servers; Kdump servers are tested on IBM System x and System p® servers. The Kdump utility is not supported if the Kdump client's operating system distribution does not match the Kdump client machine's.


                 
                Updated on 2012-09-12T15:26:55Z at 2012-09-12T15:26:55Z by jerberstark
                • stevedan
                  stevedan
                  3 Posts
                  ACCEPTED ANSWER

                  Re: RE:What's the best way to cause a system to create a crashdump in case of a system hang?

                  ‏2012-09-12T15:43:12Z  in response to jerberstark
                   I think having it in the scope, requirement and support section is fine.I did miss that it existed in the section of the documentation.
                   
                  So based on your reply, this requirement statement would indicate that the watchdog should work on Power servers, correct?
                   
                  But I think I am seeing that we don't think this watchdog approach will work so I'm confused as is Robin.
                   
                  • jerberstark
                    jerberstark
                    30 Posts
                    ACCEPTED ANSWER

                    Re: RE:What's the best way to cause a system to create a crashdump in case of a system hang?

                    ‏2012-09-12T16:08:36Z  in response to stevedan
                    Stevedan, I think we've concluded that watchdog will NOT work on Power servers. My previous reply was basically asking how we could make it easier to see/understand that in the docs. Sorry for the confusion.
                     
                    I think we need someone from the development team to weigh and confirm that watchdog isn't an option here.
                     
                    hbabu - Haren, can you confirm that Robincox2 should NOT use watchdog in this case?
                     

                    • hbabu
                      hbabu
                      8 Posts
                      ACCEPTED ANSWER

                      Re: RE:What's the best way to cause a system to create a crashdump in case of a system hang?

                      ‏2012-09-12T22:49:20Z  in response to jerberstark
                       Yes, nmi_watchdog should not be used on power to generate kdump for system hang.
                       
                      This watchdog might invoke panic() like on other archtectures, but it can not stop other CPUs (in case if they are in deadlock) since powerpc does not have software NMI. The kernel uses this SW NMI to stop other CPUs and bring them to dump to capture their states. 
                       
                      So as I mentioned above, we should always use the following ways to take the dump or put the system in debugger for system hangs:
                       
                      - HMC interface (operations->restart->dump)
                      - ASMI (parition dump option) from service processor
                      - For blades, Select the blade and click 'reboot with NMI' on blade center management module
                       
                      The above interfaces are recommended options. But we can also use 'pressing yellow button softly on the system if available'.
                       
                      Blades will have small hole next to power button. Pressing this hole softly with a pin should also invoke soft-reset. But bladecenter MM interface is preferred option.
                       
                       
                       
                       
                      • jerberstark
                        jerberstark
                        30 Posts
                        ACCEPTED ANSWER

                        Re: RE:What's the best way to cause a system to create a crashdump in case of a system hang?

                        ‏2012-09-28T15:05:40Z  in response to hbabu
                         I have updated the IPMI blueprint to clarify that it does not apply to Power systems. Thanks for your feedback.
  • robinwcox2
    robinwcox2
    11 Posts
    ACCEPTED ANSWER

    Re: What's the best way to cause a system to create a crashdump in case of a system hang?

    ‏2012-09-12T03:45:47Z  in response to stevedan
     I've been trying to set up a Firefox connection from one P5 Linux test system (a) to another P5 Linux test system's (b) service processor(s).  This is set up on system a as follows:
     
    Configure eth1 as 192.168.2.1/255.255.255.0, eth2 as  192.168.3.1/255.255.255.0
    System a's eth1 port is connected to system b's "HMC0" Ethernet port,  a's eth2 port is connected to b's "HMC1" port.
    Both connections use patch cable, although a crossover cable was tested with eth1/HMC0.
    Systems can communicate over network eth0 ports.
     
    After the network set up, it is possible to ping 192.168.2.147.  192.168.3.147 does not ping.  Firefox is brought up and the link set to:
     
              https://192.168.2.147/
     
    This connection either times out or continues indefinitely.  The 3.147 immediately fails.  Using "http://192.168.2.147/" ( without the "s") produces the same results.
     
     Is there another URL required or are there other conditions necessary to access the ASMI? 
     
    The "IBM System p5 570 Technical Overview and Introduction" (http://www.redbooks.ibm.com/redpapers/pdfs/redp9117.pdf) states:
     
              The Web interface to the Advanced System Management Interface is accessible through, at
              the time of writing, Microsoft® Internet Explorer® 6.0, Netscape 7.1, Mozilla Firefox, or
              Opera 7.23 running on a PC or mobile computer connected to the service processor.
     
    Does this mean one cannot use another P5 Firefox to access the service processor (ASMI) and must use a PC?
     
    Thanks.
     
    • jerberstark
      jerberstark
      30 Posts
      ACCEPTED ANSWER

      Re: What's the best way to cause a system to create a crashdump in case of a system hang?

      ‏2012-09-12T16:21:02Z  in response to robinwcox2
       Robinwcox2, I found some information in the Systems Hardware Info Center that disagrees with the technical overview on the supported browsers. Can you try one the browsers listed here? http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/iphby/requirements.htm
       
      Some more detailed steps for connecting and troubleshooting are also in the Systems HW Info Center: http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/iphby/connect_asmi.htm
      • robinwcox2
        robinwcox2
        11 Posts
        ACCEPTED ANSWER

        Re: What's the best way to cause a system to create a crashdump in case of a system hang?

        ‏2012-09-13T18:51:39Z  in response to jerberstark
         Netscape 7.1 works from a PC running Windows.  We don't have PCs accessible to most systems.  Firefox comes native on Linux.  If we could get a browser to work from another P5, that would eliminate a number of problems.
         
        My associate hooked the PC into the HMC2 (vs. HMC1) service processor port.  I haven't had a chance to check that with Firefox.
         
        We seem to have to jump through a number of hoops to get what seems like it should be a basic function to work.
        • robinwcox2
          robinwcox2
          11 Posts
          ACCEPTED ANSWER

          Re: What's the best way to cause a system to create a crashdump in case of a system hang?

          ‏2012-10-22T19:13:17Z  in response to robinwcox2
           We can get to the service processor from a PC running Netscape.  When the system is hung, we have selected the "System Service Aids" --> "System Dump".  This shuts down the system and reboots.  No vmcore file appears in /var/crash/<date>.
           
          We think the system would have been  set up for kdump correctly.  We get vmcores when the system is not hung and use "ALT-sysrq-c" (kernel variable kernel.sysrq set) or when placing a "1" in /proc/sys/kernel/sysrq.
          (The button on the pop-out panel looks white to me (though most have black marks from being pushed with a pen).  If that's the "yellow" button, that hasn't worked so far.  Perhaps we didn't push it lightly enough.)
           
          Are we using the wrong service processor menu option?
          • hbabu
            hbabu
            8 Posts
            ACCEPTED ANSWER

            Re: What's the best way to cause a system to create a crashdump in case of a system hang?

            ‏2012-10-22T20:30:55Z  in response to robinwcox2
             Yes, Alt-Sysrq-c worked means kdump was setup properly.
             
            "System Service Aids" --> "System Dump' is used to take FSP (service processor) dump.
            As I mentioned above, can you try "System Service Aids" --> "Partition Dump" to take kdump for hang scenarios if the system is not used HMC.
            If the system is connected to the console, you can see the system will be booted to kdump kernel, taking the dump and reboot the system. 
            • robinwcox2
              robinwcox2
              11 Posts
              ACCEPTED ANSWER

              Re: What's the best way to cause a system to create a crashdump in case of a system hang?

              ‏2012-10-22T20:43:42Z  in response to hbabu
               Thanks.  I can see that I should have read your previous response more carefully.
               
              I noticed that there's a "Service Processor Dump" after the "System Dump".  Are these the same?
              • hbabu
                hbabu
                8 Posts
                ACCEPTED ANSWER

                Re: What's the best way to cause a system to create a crashdump in case of a system hang?

                ‏2012-10-22T21:11:56Z  in response to robinwcox2
                 Here it is complete information on ASM interfaces:
                 http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/topic/iphby/iphby.pdf
                  (pages 44, 45, 46)
                System Dump:   to capture overall system information, system processor state, hardware scan rings, caches, and other information. This information can be used to resolve a hardware or server firmware problem.
                 
                Service processor dump: can preserve error data after a service processor application failure, external reset, or user request for a service processor dump
                 
                Partition dump:  By initiating a partition dump, you can preserve error data that can be used to diagnose server firmware or operating system problems. The state of the operating system is saved on the hard disk and the partition restarts. This function can be used when the operating system is in an abnormal wait state or endless loop.
                • robinwcox2
                  robinwcox2
                  11 Posts
                  ACCEPTED ANSWER

                  Re: What's the best way to cause a system to create a crashdump in case of a system hang?

                  ‏2012-10-23T01:48:56Z  in response to hbabu
                   Your description doesn't sound like these options have anything to do with getting a crashdump.  I checked /sys/kernel/kexec_crash_loaded in the system I was working on and it always was 0.  After load kexec, after reboot.
                   
                  Thanks for the link.   I'll have to check if I've seen this one.
                  • hbabu
                    hbabu
                    8 Posts
                    ACCEPTED ANSWER

                    Re: What's the best way to cause a system to create a crashdump in case of a system hang?

                    ‏2012-10-23T06:38:09Z  in response to robinwcox2
                     If 'cat /sys/kernel/kexec_crash_loaded' gives 0 means kdump kernel is not loaded. In this case, even Alt-Sysrq-c should not be successful taking the dump.
                    Please run '/etc/sysconfig/kdump restart' and see whether kexec successfully loaded kdump kernel.
                     
  • jerberstark
    jerberstark
    30 Posts
    ACCEPTED ANSWER

    Re: What's the best way to cause a system to create a crashdump in case of a system hang?

    ‏2012-11-16T16:32:03Z  in response to stevedan
     Haren's (hbabu) team wrote up a nice wiki topic to address these questions: Trigger dump on PowerLinux (https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Trigger%20Dump%20on%20PowerLinux)