Topic
  • 11 replies
  • Latest Post - ‏2012-02-21T10:26:16Z by SteveIves
SteveIves
SteveIves
14 Posts

Pinned topic Problem determination of SA Application Manager

‏2012-02-02T13:24:24Z |
Hi,

We discovered on Monday morning that the TSA Application Manager was not running and we had to issue eezdmn to start it. We also had to activate the last policy manually.

We believe that it was caused by a DB2 upgrade on Saturday (that no-one realised would affect AM). None of the AM logs in /var/ibm/tivoli/common/eez/ ssem to show anything.

Is there anywhere I can look to find out when and why the AM failed and also why the last policy was not reloaded?

Thanks,

Steve
Updated on 2012-02-21T10:26:16Z at 2012-02-21T10:26:16Z by SteveIves
  • SystemAdmin
    SystemAdmin
    46 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-02T15:09:34Z  
    Hi Steve,

    indeed DB2 with the EAUTODB is required by SA AppMan to work - so it might well be, that an update of DB2 caused an outage of SA AppMan.
    Since the SA AppMan is starting again it tells me, that the EAUTODB itself was still available after this update and not corrupted. Normally the last active End-to-End Automation policy would be activated automatically after you start everything - why this did not happen I cannot tell without looking at the logs and traces of Websphere and Automation Engine (eezdmn).

    Pls. have a look first in the log of the E2E automation domain - You can access it via Operations Console by selecting this domain and press "View log..." maybe we find some more infos here, when and why the SAAM was terminating.

    You can also try to look into the traces of the automation engine by using the SAAM Command Shell (eezcs) and use the command "eezdiag -t exc" to find exceptions written in the current trace file of the automation engine.

    One other question - can you tell me, if you have the SA AppMan automated via SAMP policy (aka "E2E HA policy")? Is the DB2 a local installation just for the SA AppMan - or is it hosted on another system and being accessed as remote DB2 from SA AppMan?

    Cheers,

    Josi.
  • SteveIves
    SteveIves
    14 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-02T15:53:43Z  
    Hi Steve,

    indeed DB2 with the EAUTODB is required by SA AppMan to work - so it might well be, that an update of DB2 caused an outage of SA AppMan.
    Since the SA AppMan is starting again it tells me, that the EAUTODB itself was still available after this update and not corrupted. Normally the last active End-to-End Automation policy would be activated automatically after you start everything - why this did not happen I cannot tell without looking at the logs and traces of Websphere and Automation Engine (eezdmn).

    Pls. have a look first in the log of the E2E automation domain - You can access it via Operations Console by selecting this domain and press "View log..." maybe we find some more infos here, when and why the SAAM was terminating.

    You can also try to look into the traces of the automation engine by using the SAAM Command Shell (eezcs) and use the command "eezdiag -t exc" to find exceptions written in the current trace file of the automation engine.

    One other question - can you tell me, if you have the SA AppMan automated via SAMP policy (aka "E2E HA policy")? Is the DB2 a local installation just for the SA AppMan - or is it hosted on another system and being accessed as remote DB2 from SA AppMan?

    Cheers,

    Josi.
    Josi,

    have a look first in the log of the E2E automation domain

    There's nothing in the log except a successfull 'online' request issued via cron before the DB2 upgrade and then the result of the eezdmn to start the domain 2 day later. All of the 'eezcs' commands issued by cron during this time are not showing up. However, we redirect the output of the 'eezcs' issued by cron to a log file and that shows lots of Java exceptions. Pleasee see the attached file for details but here's an example:

    Sun Jan 29 04:00:01 CET 2012
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
    Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
    User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
    System action: The current task ends. The transaction is rolled back.

    This happened again at 06:00 when other resources were started and again in the evening when resources were stopped. At 04:00 the next morning, we got:

    Mon Jan 30 04:00:01 CET 2012
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0121W Unable to contact the automation domain.
    Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
    EEZS0016E The command can not be executed, since no domain is contacted.
    Explanation: This command requires a connected automation domain.
    User action: Open a command shell again and specify an online automation domain to be contacted.

    eezdiag -t exc does not show anything:

    EEZCS>eezdiag -t exc

    EEZCS>

    We have not automated the SA Application Manager. I do not know anything about this but is that using SA MultiPlatforms to automate the AM and make it highly available?

    The DB2 installation is not local, but is remote and shared.

    Thanks,

    Steve
  • SystemAdmin
    SystemAdmin
    46 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-03T07:52:45Z  
    • SteveIves
    • ‏2012-02-02T15:53:43Z
    Josi,

    have a look first in the log of the E2E automation domain

    There's nothing in the log except a successfull 'online' request issued via cron before the DB2 upgrade and then the result of the eezdmn to start the domain 2 day later. All of the 'eezcs' commands issued by cron during this time are not showing up. However, we redirect the output of the 'eezcs' issued by cron to a log file and that shows lots of Java exceptions. Pleasee see the attached file for details but here's an example:

    Sun Jan 29 04:00:01 CET 2012
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
    Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
    User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
    System action: The current task ends. The transaction is rolled back.

    This happened again at 06:00 when other resources were started and again in the evening when resources were stopped. At 04:00 the next morning, we got:

    Mon Jan 30 04:00:01 CET 2012
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0121W Unable to contact the automation domain.
    Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
    EEZS0016E The command can not be executed, since no domain is contacted.
    Explanation: This command requires a connected automation domain.
    User action: Open a command shell again and specify an online automation domain to be contacted.

    eezdiag -t exc does not show anything:

    EEZCS>eezdiag -t exc

    EEZCS>

    We have not automated the SA Application Manager. I do not know anything about this but is that using SA MultiPlatforms to automate the AM and make it highly available?

    The DB2 installation is not local, but is remote and shared.

    Thanks,

    Steve
    Hi Steve,

    thanks for providing your trace file. For a real problem determination I would like to have a look at the trace files, which SA Application Manager writes.

    How I interpret your cron trace:
    1. Trace entry showing normal / good case result:
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0120I Using domain EPOQUE.
    EEZS0131I The Stop request has been issued against resource INTGEP7.

    I want to ensure -> EPOQUE is the "End-to-End" automation domain, which you were missing, since you told me you had to restart the automation engine using the command "eezdmn", right?

    2. Trace entry showing that SA AM JEE Framework ("the WAS stuff") was having trouble with DB2 access
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
    Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
    User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
    System action: The current task ends. The transaction is rolled back.

    3. Trace entry which tells me that DB2 is reachable again - but domain "EPOQUE" is not reachable because the end-to-end automation domain is not online.
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0121W Unable to contact the automation domain.
    Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
    EEZS0016E The command can not be executed, since no domain is contacted.
    Explanation: This command requires a connected automation domain.
    User action: Open a command shell again and specify an online automation domain to be contacted.
    System action: Processing ends.

    So, assuming that EPOQUE is the end-to-end domain - I need to have a look at the SA Application Manager trace files to see, why the automation engine ("eezdmn") did terminate. Can you provide this to me? You find it in the "Tivoli Common Directory" which usually is /var/ibm/tivoli/common/eez/logs. Here especially the file traceFlatEngine.log is interesting.

    Cheers,

    Josi.
  • SystemAdmin
    SystemAdmin
    46 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-03T08:08:33Z  
    Hi Steve,

    thanks for providing your trace file. For a real problem determination I would like to have a look at the trace files, which SA Application Manager writes.

    How I interpret your cron trace:
    1. Trace entry showing normal / good case result:
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0120I Using domain EPOQUE.
    EEZS0131I The Stop request has been issued against resource INTGEP7.

    I want to ensure -> EPOQUE is the "End-to-End" automation domain, which you were missing, since you told me you had to restart the automation engine using the command "eezdmn", right?

    2. Trace entry showing that SA AM JEE Framework ("the WAS stuff") was having trouble with DB2 access
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
    Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
    User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
    System action: The current task ends. The transaction is rolled back.

    3. Trace entry which tells me that DB2 is reachable again - but domain "EPOQUE" is not reachable because the end-to-end automation domain is not online.
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0121W Unable to contact the automation domain.
    Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
    EEZS0016E The command can not be executed, since no domain is contacted.
    Explanation: This command requires a connected automation domain.
    User action: Open a command shell again and specify an online automation domain to be contacted.
    System action: Processing ends.

    So, assuming that EPOQUE is the end-to-end domain - I need to have a look at the SA Application Manager trace files to see, why the automation engine ("eezdmn") did terminate. Can you provide this to me? You find it in the "Tivoli Common Directory" which usually is /var/ibm/tivoli/common/eez/logs. Here especially the file traceFlatEngine.log is interesting.

    Cheers,

    Josi.
    Uuups I forgot....

    "...We have not automated the SA Application Manager. I do not know anything about this but is that using SA MultiPlatforms to automate the AM and make it highly available?..."

    Yes - this s true. You can automate the SA AppMan processes by using a SAMP cluster. With help of "cfgeezdmn" you can even create the HA policy for this scenario. If you had setup this then SAMP would have restarted the automation engine process automatically.

    Josi.
  • SteveIves
    SteveIves
    14 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-03T09:42:52Z  
    Hi Steve,

    thanks for providing your trace file. For a real problem determination I would like to have a look at the trace files, which SA Application Manager writes.

    How I interpret your cron trace:
    1. Trace entry showing normal / good case result:
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0120I Using domain EPOQUE.
    EEZS0131I The Stop request has been issued against resource INTGEP7.

    I want to ensure -> EPOQUE is the "End-to-End" automation domain, which you were missing, since you told me you had to restart the automation engine using the command "eezdmn", right?

    2. Trace entry showing that SA AM JEE Framework ("the WAS stuff") was having trouble with DB2 access
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZJ0001E The WebSphere infrastructure has reported a severe error situation: javax.ejb.TransactionRolledbackLocalException: ; nested exception is: java.sql.SQLException: jcct4204111392http://3.59.81 Error executing XAResource.start(). Server returned XAER_RMFAIL. ERRORCODE=-4203, SQLSTATE=nullDSRA0010E: SQL State = null, Error Code = -4,203
    Explanation: The application was interrupted by a RuntimeException and cannot complete its task.
    User action: Check the description of the error situation if it indicates that the server database or another subsystem is unavailable. If the problem persists, contact IBM support.
    System action: The current task ends. The transaction is rolled back.

    3. Trace entry which tells me that DB2 is reachable again - but domain "EPOQUE" is not reachable because the end-to-end automation domain is not online.
    Connecting to SA Application Manager on Server luu224p.internal.epo.org
    EEZS0121W Unable to contact the automation domain.
    Explanation: If you specified the -D flag: a matching automation domain could not be found, or it is not in state Online. If you did not specify the -D flag: the automation manager currently does not host any end-to-end automation domain which is in state Online. In any case, only a limited set of commands work in this situation.
    EEZS0016E The command can not be executed, since no domain is contacted.
    Explanation: This command requires a connected automation domain.
    User action: Open a command shell again and specify an online automation domain to be contacted.
    System action: Processing ends.

    So, assuming that EPOQUE is the end-to-end domain - I need to have a look at the SA Application Manager trace files to see, why the automation engine ("eezdmn") did terminate. Can you provide this to me? You find it in the "Tivoli Common Directory" which usually is /var/ibm/tivoli/common/eez/logs. Here especially the file traceFlatEngine.log is interesting.

    Cheers,

    Josi.
    josi,

    Sorry for not explaining more. Yes - the EPOQUE domain is our SA AM e2E domain. (The SA MP domains are EPOQUE_INTG and EPOQUE_OSAT and EPOQUE_PROD which are basically dev, UAT and Production). There's also PRODPLX1.INGSXG which is an SA z/OS domain.

    The eezcs commands I sent you are all issued against the E2E domain, so there was no specification of -D.

    The DB2 outage was apparently from Saturday, 28 January, stopped at 9:00 and started again at 9:10. The server is called luu063p or db2srv3-p. Prior to the outage, our E2E resources were successfully started at 06:00 Following the DB2 outage, the next eezcs commands issued were the ones issued at 04:00 on Sunday to stop the resources, which failed. (we start at 06:00 and stop at 04:00, with 2 hours downtime).

    It is interesting that at 04:00 Sunday and at 06:00 Sunday, the E2E manager was still running, although the eezcs commands received Java/DB2 errrors, but at 04:00 Monday, the EEZCS command reported that the domain was unavailable, so the domain failed between 06:00 sunday and 04:00 Monday, well after the DB2 outage.

    Our /var/ibm/tivoli/common/eez/logs/traceFlatEngine.log only goes back 24 hours. Can this be changed? I'm seeing if we have a backup from the period covering sunday, when we thing the AM actually failed.

    Steve
  • SteveIves
    SteveIves
    14 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-03T12:34:03Z  
    • SteveIves
    • ‏2012-02-03T09:42:52Z
    josi,

    Sorry for not explaining more. Yes - the EPOQUE domain is our SA AM e2E domain. (The SA MP domains are EPOQUE_INTG and EPOQUE_OSAT and EPOQUE_PROD which are basically dev, UAT and Production). There's also PRODPLX1.INGSXG which is an SA z/OS domain.

    The eezcs commands I sent you are all issued against the E2E domain, so there was no specification of -D.

    The DB2 outage was apparently from Saturday, 28 January, stopped at 9:00 and started again at 9:10. The server is called luu063p or db2srv3-p. Prior to the outage, our E2E resources were successfully started at 06:00 Following the DB2 outage, the next eezcs commands issued were the ones issued at 04:00 on Sunday to stop the resources, which failed. (we start at 06:00 and stop at 04:00, with 2 hours downtime).

    It is interesting that at 04:00 Sunday and at 06:00 Sunday, the E2E manager was still running, although the eezcs commands received Java/DB2 errrors, but at 04:00 Monday, the EEZCS command reported that the domain was unavailable, so the domain failed between 06:00 sunday and 04:00 Monday, well after the DB2 outage.

    Our /var/ibm/tivoli/common/eez/logs/traceFlatEngine.log only goes back 24 hours. Can this be changed? I'm seeing if we have a backup from the period covering sunday, when we thing the AM actually failed.

    Steve
    Josi,

    Would you normally expect to have to restart the Automation Engine in event of a DB2 outage?

    Steve
  • SteveIves
    SteveIves
    14 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-03T13:59:27Z  
    • SteveIves
    • ‏2012-02-03T12:34:03Z
    Josi,

    Would you normally expect to have to restart the Automation Engine in event of a DB2 outage?

    Steve
    Josi,

    Managed to restore traceFlatEngine.log.

    Please find attached. do you want me to open a PMR with IBM?

    Steve
  • SystemAdmin
    SystemAdmin
    46 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-03T14:03:31Z  
    • SteveIves
    • ‏2012-02-03T12:34:03Z
    Josi,

    Would you normally expect to have to restart the Automation Engine in event of a DB2 outage?

    Steve
    Hi Steve,

    sorry for my late response.
    No, I would not expect this - thats the reason why we need to figure out the reason why the automation engine was stopping.
    In the tivoli common dir there are more trace files. If the traceFlatEngine.log only shows you last 24h - maybe you have a look in the traceFlatEngine1.log which should be there as well. The tracing is standard setup to with 8MB in one file - if its full it will open a new file and copy the current to <traceFileName>1.log. With the tool "cfgeezdmn" on the "logger" tab you can specify how many MB you want to spend for tracing/logging on disk.

    Another chance 1) - next to the ..../log directory there is a .../ffdc directory - have a look there as well for snap....log file which are being created whenever an Error is being logged by the Automation engine.

    Another chance 2) - the msgFlatEngine.log might also show Errors. Its ususally not so big and thus contains infos in a longer time.

    But again - normally I don't expect the engine to fail. So if we dont find any good explanation in the log - can I ask you to open a PMR for this?

    greeting,

    Josi
  • SteveIves
    SteveIves
    14 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-06T15:21:04Z  
    Hi Steve,

    sorry for my late response.
    No, I would not expect this - thats the reason why we need to figure out the reason why the automation engine was stopping.
    In the tivoli common dir there are more trace files. If the traceFlatEngine.log only shows you last 24h - maybe you have a look in the traceFlatEngine1.log which should be there as well. The tracing is standard setup to with 8MB in one file - if its full it will open a new file and copy the current to <traceFileName>1.log. With the tool "cfgeezdmn" on the "logger" tab you can specify how many MB you want to spend for tracing/logging on disk.

    Another chance 1) - next to the ..../log directory there is a .../ffdc directory - have a look there as well for snap....log file which are being created whenever an Error is being logged by the Automation engine.

    Another chance 2) - the msgFlatEngine.log might also show Errors. Its ususally not so big and thus contains infos in a longer time.

    But again - normally I don't expect the engine to fail. So if we dont find any good explanation in the log - can I ask you to open a PMR for this?

    greeting,

    Josi
    Josi,

    There is nothing in the ffdc directory from the 28/29th January.

    Regards,

    Steve
  • SystemAdmin
    SystemAdmin
    46 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-07T13:15:15Z  
    • SteveIves
    • ‏2012-02-06T15:21:04Z
    Josi,

    There is nothing in the ffdc directory from the 28/29th January.

    Regards,

    Steve
    Hi Steve,

    okay, that means that at least there was no EEZD****E message being logged at that time, which would have automatically caused an FFDC trace to be written down on disk.

    cheers,

    Josi.
  • SteveIves
    SteveIves
    14 Posts

    Re: Problem determination of SA Application Manager

    ‏2012-02-21T10:26:16Z  
    Hi Steve,

    okay, that means that at least there was no EEZD****E message being logged at that time, which would have automatically caused an FFDC trace to be written down on disk.

    cheers,

    Josi.
    Josi,

    Did you manage to see anyting that indicated why the automation engine ABENDed?

    Steve