Topic
11 replies Latest Post - ‏2012-02-21T10:26:16Z by SteveIves
SteveIves
SteveIves
14 Posts
ACCEPTED ANSWER

Pinned topic Problem determination of SA Application Manager

‏2012-02-02T13:24:24Z |
Hi,

We discovered on Monday morning that the TSA Application Manager was not running and we had to issue eezdmn to start it. We also had to activate the last policy manually.

We believe that it was caused by a DB2 upgrade on Saturday (that no-one realised would affect AM). None of the AM logs in /var/ibm/tivoli/common/eez/ ssem to show anything.

Is there anywhere I can look to find out when and why the AM failed and also why the last policy was not reloaded?

Thanks,

Steve
Updated on 2012-02-21T10:26:16Z at 2012-02-21T10:26:16Z by SteveIves
  • SystemAdmin
    SystemAdmin
    46 Posts
    ACCEPTED ANSWER

    Re: Problem determination of SA Application Manager

    ‏2012-02-02T15:09:34Z  in response to SteveIves
    Hi Steve,

    indeed DB2 with the EAUTODB is required by SA AppMan to work - so it might well be, that an update of DB2 caused an outage of SA AppMan.
    Since the SA AppMan is starting again it tells me, that the EAUTODB itself was still available after this update and not corrupted. Normally the last active End-to-End Automation policy would be activated automatically after you start everything - why this did not happen I cannot tell without looking at the logs and traces of Websphere and Automation Engine (eezdmn).

    Pls. have a look first in the log of the E2E automation domain - You can access it via Operations Console by selecting this domain and press "View log..." maybe we find some more infos here, when and why the SAAM was terminating.

    You can also try to look into the traces of the automation engine by using the SAAM Command Shell (eezcs) and use the command "eezdiag -t exc" to find exceptions written in the current trace file of the automation engine.

    One other question - can you tell me, if you have the SA AppMan automated via SAMP policy (aka "E2E HA policy")? Is the DB2 a local installation just for the SA AppMan - or is it hosted on another system and being accessed as remote DB2 from SA AppMan?

    Cheers,

    Josi.
    • SteveIves
      SteveIves
      14 Posts
      ACCEPTED ANSWER

      Re: Problem determination of SA Application Manager

      ‏2012-02-02T15:53:43Z  in response to SystemAdmin
      Josi,

      have a look first in the log of the E2E automation domain

      There's nothing in the log except a successfull 'online' request issued via cron before the DB2 upgrade and then the result of the eezdmn to start the domain 2 day later. All of the 'eezcs' commands issued by cron during this time are not showing up. However, we redirect the output of the 'eezcs' issued by cron to a log file and that shows lots of Java exceptions. Pleasee see the attached file for details but here's an example:

      Sun Jan 29 04:00:01 CET 2012
      Connecting to SA Application Manager on Server luu224p.internal.epo.org
      EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
      Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
      User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
      System action: The current task ends. The transaction is rolled back.

      This happened again at 06:00 when other resources were started and again in the evening when resources were stopped. At 04:00 the next morning, we got:

      Mon Jan 30 04:00:01 CET 2012
      Connecting to SA Application Manager on Server luu224p.internal.epo.org
      EEZS0121W Unable to contact the automation domain.
      Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
      EEZS0016E The command can not be executed, since no domain is contacted.
      Explanation: This command requires a connected automation domain.
      User action: Open a command shell again and specify an online automation domain to be contacted.

      eezdiag -t exc does not show anything:

      EEZCS>eezdiag -t exc

      EEZCS>

      We have not automated the SA Application Manager. I do not know anything about this but is that using SA MultiPlatforms to automate the AM and make it highly available?

      The DB2 installation is not local, but is remote and shared.

      Thanks,

      Steve
      • SystemAdmin
        SystemAdmin
        46 Posts
        ACCEPTED ANSWER

        Re: Problem determination of SA Application Manager

        ‏2012-02-03T07:52:45Z  in response to SteveIves
        Hi Steve,

        thanks for providing your trace file. For a real problem determination I would like to have a look at the trace files, which SA Application Manager writes.

        How I interpret your cron trace:
        1. Trace entry showing normal / good case result:
        Connecting to SA Application Manager on Server luu224p.internal.epo.org
        EEZS0120I Using domain EPOQUE.
        EEZS0131I The Stop request has been issued against resource INTGEP7.

        I want to ensure -> EPOQUE is the "End-to-End" automation domain, which you were missing, since you told me you had to restart the automation engine using the command "eezdmn", right?

        2. Trace entry showing that SA AM JEE Framework ("the WAS stuff") was having trouble with DB2 access
        Connecting to SA Application Manager on Server luu224p.internal.epo.org
        EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
        Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
        User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
        System action: The current task ends. The transaction is rolled back.

        3. Trace entry which tells me that DB2 is reachable again - but domain "EPOQUE" is not reachable because the end-to-end automation domain is not online.
        Connecting to SA Application Manager on Server luu224p.internal.epo.org
        EEZS0121W Unable to contact the automation domain.
        Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
        EEZS0016E The command can not be executed, since no domain is contacted.
        Explanation: This command requires a connected automation domain.
        User action: Open a command shell again and specify an online automation domain to be contacted.
        System action: Processing ends.

        So, assuming that EPOQUE is the end-to-end domain - I need to have a look at the SA Application Manager trace files to see, why the automation engine ("eezdmn") did terminate. Can you provide this to me? You find it in the "Tivoli Common Directory" which usually is /var/ibm/tivoli/common/eez/logs. Here especially the file traceFlatEngine.log is interesting.

        Cheers,

        Josi.
        • SystemAdmin
          SystemAdmin
          46 Posts
          ACCEPTED ANSWER

          Re: Problem determination of SA Application Manager

          ‏2012-02-03T08:08:33Z  in response to SystemAdmin
          Uuups I forgot....

          "...We have not automated the SA Application Manager. I do not know anything about this but is that using SA MultiPlatforms to automate the AM and make it highly available?..."

          Yes - this s true. You can automate the SA AppMan processes by using a SAMP cluster. With help of "cfgeezdmn" you can even create the HA policy for this scenario. If you had setup this then SAMP would have restarted the automation engine process automatically.

          Josi.
        • SteveIves
          SteveIves
          14 Posts
          ACCEPTED ANSWER

          Re: Problem determination of SA Application Manager

          ‏2012-02-03T09:42:52Z  in response to SystemAdmin
          josi,

          Sorry for not explaining more. Yes - the EPOQUE domain is our SA AM e2E domain. (The SA MP domains are EPOQUE_INTG and EPOQUE_OSAT and EPOQUE_PROD which are basically dev, UAT and Production). There's also PRODPLX1.INGSXG which is an SA z/OS domain.

          The eezcs commands I sent you are all issued against the E2E domain, so there was no specification of -D.

          The DB2 outage was apparently from Saturday, 28 January, stopped at 9:00 and started again at 9:10. The server is called luu063p or db2srv3-p. Prior to the outage, our E2E resources were successfully started at 06:00 Following the DB2 outage, the next eezcs commands issued were the ones issued at 04:00 on Sunday to stop the resources, which failed. (we start at 06:00 and stop at 04:00, with 2 hours downtime).

          It is interesting that at 04:00 Sunday and at 06:00 Sunday, the E2E manager was still running, although the eezcs commands received Java/DB2 errrors, but at 04:00 Monday, the EEZCS command reported that the domain was unavailable, so the domain failed between 06:00 sunday and 04:00 Monday, well after the DB2 outage.

          Our /var/ibm/tivoli/common/eez/logs/traceFlatEngine.log only goes back 24 hours. Can this be changed? I'm seeing if we have a backup from the period covering sunday, when we thing the AM actually failed.

          Steve
          • SteveIves
            SteveIves
            14 Posts
            ACCEPTED ANSWER

            Re: Problem determination of SA Application Manager

            ‏2012-02-03T12:34:03Z  in response to SteveIves
            Josi,

            Would you normally expect to have to restart the Automation Engine in event of a DB2 outage?

            Steve
            • SteveIves
              SteveIves
              14 Posts
              ACCEPTED ANSWER

              Re: Problem determination of SA Application Manager

              ‏2012-02-03T13:59:27Z  in response to SteveIves
              Josi,

              Managed to restore traceFlatEngine.log.

              Please find attached. do you want me to open a PMR with IBM?

              Steve
            • SystemAdmin
              SystemAdmin
              46 Posts
              ACCEPTED ANSWER

              Re: Problem determination of SA Application Manager

              ‏2012-02-03T14:03:31Z  in response to SteveIves
              Hi Steve,

              sorry for my late response.
              No, I would not expect this - thats the reason why we need to figure out the reason why the automation engine was stopping.
              In the tivoli common dir there are more trace files. If the traceFlatEngine.log only shows you last 24h - maybe you have a look in the traceFlatEngine1.log which should be there as well. The tracing is standard setup to with 8MB in one file - if its full it will open a new file and copy the current to <traceFileName>1.log. With the tool "cfgeezdmn" on the "logger" tab you can specify how many MB you want to spend for tracing/logging on disk.

              Another chance 1) - next to the ..../log directory there is a .../ffdc directory - have a look there as well for snap....log file which are being created whenever an Error is being logged by the Automation engine.

              Another chance 2) - the msgFlatEngine.log might also show Errors. Its ususally not so big and thus contains infos in a longer time.

              But again - normally I don't expect the engine to fail. So if we dont find any good explanation in the log - can I ask you to open a PMR for this?

              greeting,

              Josi
              • SteveIves
                SteveIves
                14 Posts
                ACCEPTED ANSWER

                Re: Problem determination of SA Application Manager

                ‏2012-02-06T15:21:04Z  in response to SystemAdmin
                Josi,

                There is nothing in the ffdc directory from the 28/29th January.

                Regards,

                Steve
                • SystemAdmin
                  SystemAdmin
                  46 Posts
                  ACCEPTED ANSWER

                  Re: Problem determination of SA Application Manager

                  ‏2012-02-07T13:15:15Z  in response to SteveIves
                  Hi Steve,

                  okay, that means that at least there was no EEZD****E message being logged at that time, which would have automatically caused an FFDC trace to be written down on disk.

                  cheers,

                  Josi.
                  • SteveIves
                    SteveIves
                    14 Posts
                    ACCEPTED ANSWER

                    Re: Problem determination of SA Application Manager

                    ‏2012-02-21T10:26:16Z  in response to SystemAdmin
                    Josi,

                    Did you manage to see anyting that indicated why the automation engine ABENDed?

                    Steve