Topic
  • 22 replies
  • Latest Post - ‏2013-11-20T18:48:57Z by Jinhui Qin
Jinhui Qin
Jinhui Qin
17 Posts

Pinned topic Partially Failed when starting an instance on multiple hosts in Streams 3.1

‏2013-06-21T15:48:53Z |

Hi,

With all the new features provided in Streams 3.1, we are considering to upgrade our current version of Streams from 3.0 to 3.1. Recently we tested the installation of Streams 3.1 on a cluster of CentOS 6 nodes. The installation of Streams 3.1 (on all nodes) and the Streams Studio (only on the head node) was successful. I was able to create and start an instance across multiple hosts successfully, but after a few seconds the "hc" (host controller service) services on all nodes failed except for the one on the head node, then the instance became "partially failed". Then I was able to use the new feature provided in the Streams 3.1 Web  Consel to repair the intance sucessfully, but after a while, those "hc" on all child nodes failed again, all these happend when I didn't even submit any jobs to the instance. It seems that  all the services running on the head node were all fine. We didn't have such a problem when using Streams 3.0 on a cluster. Anyone could provide us any clue what could cause the problem? or anything we need to adjust in Streams 3.1?  Attached is the logs for the instance that I downloaded from the Streams Console, we really hope anyone here could help us out. Thanks!!

 

Jinhui

 

 

Attachments

  • Jinhui Qin
    Jinhui Qin
    17 Posts
    ACCEPTED ANSWER

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-11-20T18:48:57Z  

    Denny,

    Thanks for your reply. We may just keep using our current Streams 3.0 and consider the upgrade to Streams 3.1 later when the fixpack for Streams 3.1 is available. Thanks again for your helps.

    Jinhui

    I just did some test and found out that the problem was fixed in Streams 3.2. Now we are planing to upgrade our environment from Streams 3.0 to Streams 3.2.

  • Stan
    Stan
    76 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-24T15:42:47Z  

    Please increase the process limit and see if the failure still happens

    ### 21 Jun 2013 11:08:17 END:   .. verifyInstallCompat() - rc:0
    #########################################################################################
    #####  WARNING WARNING WARNING !! ULIMIT CHECK on Host ecco-computer19.sharcnet.ca
    #####  ulimit max user processes (-u) setting of 1024 is LOW
    #####  See InfoSphere Streams Information Center for ulimit recommendations
    #########################################################################################

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-24T17:00:36Z  
    • Stan
    • ‏2013-06-24T15:42:47Z

    Please increase the process limit and see if the failure still happens

    ### 21 Jun 2013 11:08:17 END:   .. verifyInstallCompat() - rc:0
    #########################################################################################
    #####  WARNING WARNING WARNING !! ULIMIT CHECK on Host ecco-computer19.sharcnet.ca
    #####  ulimit max user processes (-u) setting of 1024 is LOW
    #####  See InfoSphere Streams Information Center for ulimit recommendations
    #########################################################################################

    Stan,

    Thanks for your reply. I realized this warning message, but when I installed Streams 3.0, the same warning message appeared and without any adjustment, everything seemed worked fine in Streams 3.0.  Anyways, I will try to adjust the max ulimit setting to see if the problem can be solved. 

     

    Jinhui

     

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-25T13:41:34Z  
    • Stan
    • ‏2013-06-24T15:42:47Z

    Please increase the process limit and see if the failure still happens

    ### 21 Jun 2013 11:08:17 END:   .. verifyInstallCompat() - rc:0
    #########################################################################################
    #####  WARNING WARNING WARNING !! ULIMIT CHECK on Host ecco-computer19.sharcnet.ca
    #####  ulimit max user processes (-u) setting of 1024 is LOW
    #####  See InfoSphere Streams Information Center for ulimit recommendations
    #########################################################################################

    Stan,

    We have tried by increasing the process limit from 1024 to 65536 on all nodes, the warning message is gone, but the failure still happened after the instance started properly for a couple of min. and I didn't even submit any jobs to the instance yet, the hc services on all the child nodes failed. Attached is the new logs that I downloaded from the instance. Could you or anyone find any clue from it and give us any more advises ? Your help is really appreciated.

     

    Jinhui 

    Attachments

  • jingdongsun
    jingdongsun
    3 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T03:57:52Z  

    Stan,

    We have tried by increasing the process limit from 1024 to 65536 on all nodes, the warning message is gone, but the failure still happened after the instance started properly for a couple of min. and I didn't even submit any jobs to the instance yet, the hc services on all the child nodes failed. Attached is the new logs that I downloaded from the instance. Could you or anyone find any clue from it and give us any more advises ? Your help is really appreciated.

     

    Jinhui 

    please run streamtool checkhost to see if any reported error, especially about network connection among hosts.

    And also, please verify to make sure firewall are disabled among all hosts.

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T14:24:30Z  

    please run streamtool checkhost to see if any reported error, especially about network connection among hosts.

    And also, please verify to make sure firewall are disabled among all hosts.

    Thanks Jingdong for your reply. At the beginning we thought it might be a firewall issue, we made sure that the firewall was turned off on all hosts, however the failure still happened.

    According to your suggestion, the output from "streamtool checkhost " didn't show any errors, this was an instance running on three hosts, 

     

    [jhqin@ecco-computer19 ~]$ streamtool checkhost

    Date: Wed Jun 26 10:00:32 EDT 2013
    Host: ecco-computer19  
    Instance: streams@jhqin
    3 Hosts to check: 199.241.160.146,199.241.160.148,199.241.160.139
    Reference host: 199.241.160.146



    =============================================================
    Phase 1 - per-host public key ssh connectivity test...
    =============================================================

    Checking host 1 of 3: 199.241.160.146...  host OK
    Checking host 2 of 3: 199.241.160.148...  host OK
    Checking host 3 of 3: 199.241.160.139...  host OK

    Phase 1 - public key ssh connectivity test summary:
    3 OK hosts.
    0 problem hosts:



    =============================================================
    Phase 2 - per-host dependency checking...
    =============================================================

    Checking host 1 of 3: 199.241.160.146...  host OK
    Checking host 2 of 3: 199.241.160.148...  host OK
    Checking host 3 of 3: 199.241.160.139...  host OK

    Phase 2 - per host dependency checking summary:
    3 OK hosts.
    0 problem hosts:
    0 problem categories:


    =============================================================
    Detailed host results
    Verbosity level: 1
    =============================================================




    =============================================================
    Overall Summary
    =============================================================

    3 hosts checked.
    3 OK hosts.
    0 problem hosts:

     

    _________________

    Still, the hc services on two child nodes failed, everything on the head node is fine. Attached is the logs for this instance, again, click on "Repair Instance" from Streams Console did make them all running healthy for a few seconds, once you refresh, it became "Partially failed" again for the same reason. Any other suggestion?  

    Attachments

  • jingdongsun
    jingdongsun
    3 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T16:23:18Z  

    Thanks Jingdong for your reply. At the beginning we thought it might be a firewall issue, we made sure that the firewall was turned off on all hosts, however the failure still happened.

    According to your suggestion, the output from "streamtool checkhost " didn't show any errors, this was an instance running on three hosts, 

     

    [jhqin@ecco-computer19 ~]$ streamtool checkhost

    Date: Wed Jun 26 10:00:32 EDT 2013
    Host: ecco-computer19  
    Instance: streams@jhqin
    3 Hosts to check: 199.241.160.146,199.241.160.148,199.241.160.139
    Reference host: 199.241.160.146



    =============================================================
    Phase 1 - per-host public key ssh connectivity test...
    =============================================================

    Checking host 1 of 3: 199.241.160.146...  host OK
    Checking host 2 of 3: 199.241.160.148...  host OK
    Checking host 3 of 3: 199.241.160.139...  host OK

    Phase 1 - public key ssh connectivity test summary:
    3 OK hosts.
    0 problem hosts:



    =============================================================
    Phase 2 - per-host dependency checking...
    =============================================================

    Checking host 1 of 3: 199.241.160.146...  host OK
    Checking host 2 of 3: 199.241.160.148...  host OK
    Checking host 3 of 3: 199.241.160.139...  host OK

    Phase 2 - per host dependency checking summary:
    3 OK hosts.
    0 problem hosts:
    0 problem categories:


    =============================================================
    Detailed host results
    Verbosity level: 1
    =============================================================




    =============================================================
    Overall Summary
    =============================================================

    3 hosts checked.
    3 OK hosts.
    0 problem hosts:

     

    _________________

    Still, the hc services on two child nodes failed, everything on the head node is fine. Attached is the logs for this instance, again, click on "Repair Instance" from Streams Console did make them all running healthy for a few seconds, once you refresh, it became "Partially failed" again for the same reason. Any other suggestion?  

    Based on the trace, all services are up alright, but still, all Corba calls cross hosts failed.

    I still think this is a network issue, but I do not know what next step we need to check, possibly double check all host network settings?

    Also, "streamtool checkhost --connectivity-only" may also give some clue and worth a try, as it will do more cross-checking.

    Another suggestion is that, if you do not need multiple hosts for current work, please try to run instance with single host

    Thanks.

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T17:00:24Z  

    Based on the trace, all services are up alright, but still, all Corba calls cross hosts failed.

    I still think this is a network issue, but I do not know what next step we need to check, possibly double check all host network settings?

    Also, "streamtool checkhost --connectivity-only" may also give some clue and worth a try, as it will do more cross-checking.

    Another suggestion is that, if you do not need multiple hosts for current work, please try to run instance with single host

    Thanks.

    "streamtool checkhost --connectivity-only" also gave me no errors, this is an instance running on 4 hosts with the same problem as before, Single host instance only has two cores, which is not enough for running our jobs. We do have a successful environment for running Streams job crossing mutliple hosts in Streams 3.0. Since Streams 3.1 was released, we were considering to upgrade to Streams 3.1, however we ran into such a problem. I agreed with you that this was a network issure. 

    At the " Streams 3.1 Developers Conference Webcast on June 6", I learned from one of the speakers (Denny Hatzenbihler ?)  who talked about Streams Runtime, he mentioned that in Streams 3.1 the applilcation network setting was done differently for the performance concideration. Streams 3.1 separated the control traffic from user application traffic, I was just gussing if that could be related to the issue we encoutered, and we need to do some adjustment somewhere when using Streams 3.1, but we don't know how and where.

    Thanks for your suggestions. We are still hoping someone here can help us in solving the problem. 

     

    here is the output from "streamtool checkhost" and attached is the logs for this instance that ran on 4 hosts.

    [jhqin@ecco-computer4 bin]$ streamtool checkhost --connectivity-only

    Checking connectivity between the following hosts:

    199.241.160.131,199.241.160.130,199.241.160.167,199.241.160.168

    Checking host: 199.241.160.131...
    Checking host: 199.241.160.130...
    Checking host: 199.241.160.167...
    Checking host: 199.241.160.168...

    There were no failures found validating connectivity between hosts.

    [jhqin@ecco-computer4 bin]$ streamtool checkhost -a

    Date: Wed Jun 26 12:28:43 EDT 2013
    Host: ecco-computer4  
    Instance: streams@jhqin
    4 Hosts to check: 199.241.160.131,199.241.160.130,199.241.160.167,199.241.160.168
    Reference host: 199.241.160.131



    =============================================================
    Phase 1 - per-host public key ssh connectivity test...
    =============================================================

    Checking host 1 of 4: 199.241.160.131...  host OK
    Checking host 2 of 4: 199.241.160.130...  host OK
    Checking host 3 of 4: 199.241.160.167...  host OK
    Checking host 4 of 4: 199.241.160.168...  host OK

    Phase 1 - public key ssh connectivity test summary:
    4 OK hosts.
    0 problem hosts:



    =============================================================
    Phase 2 - per-host dependency checking...
    =============================================================

    Checking host 1 of 4: 199.241.160.131...  host OK
    Checking host 2 of 4: 199.241.160.130...  host OK
    Checking host 3 of 4: 199.241.160.167...  host OK
    Checking host 4 of 4: 199.241.160.168...  host OK

    Phase 2 - per host dependency checking summary:
    4 OK hosts.
    0 problem hosts:
    0 problem categories:


    =============================================================
    Detailed host results
    Verbosity level: 1
    =============================================================




    =============================================================
    Overall Summary
    =============================================================

    4 hosts checked.
    4 OK hosts.
    0 problem hosts:

     

    Attachments

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T17:40:19Z  

    Based on the trace, all services are up alright, but still, all Corba calls cross hosts failed.

    I still think this is a network issue, but I do not know what next step we need to check, possibly double check all host network settings?

    Also, "streamtool checkhost --connectivity-only" may also give some clue and worth a try, as it will do more cross-checking.

    Another suggestion is that, if you do not need multiple hosts for current work, please try to run instance with single host

    Thanks.

    Regarding to the network configureation of all the hosts in our cluster,  our system admin just reminded that all the hosts in the cluster are dual-homed, with public and admin networks. The admin network is locked down and will not allow communication with any non-whitelisted servers (which these are not). This didn't seem to be an issue with Streams 3.0.

  • DennyHatz
    DennyHatz
    102 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T18:22:57Z  

    Regarding to the network configureation of all the hosts in our cluster,  our system admin just reminded that all the hosts in the cluster are dual-homed, with public and admin networks. The admin network is locked down and will not allow communication with any non-whitelisted servers (which these are not). This didn't seem to be an issue with Streams 3.0.

    I assume from looking at the logs that the interface you want to be using (public) is the eth1 199.241.160.xxx address correct?

    Can you turn on additional logging by issuing:

     streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

    Then attach the logs after trying to start the instance.

    Thank you

     

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T19:20:09Z  
    • DennyHatz
    • ‏2013-06-26T18:22:57Z

    I assume from looking at the logs that the interface you want to be using (public) is the eth1 199.241.160.xxx address correct?

    Can you turn on additional logging by issuing:

     streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

    Then attach the logs after trying to start the instance.

    Thank you

     

    Hi DennyHatz,

    Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

    By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

    [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
    CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
    CDISC0059I The system is starting the streams@jhqin instance.
    CDISC0078I The system is starting the runtime services on 4 hosts.
    CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
    CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
    CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
    CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
    CDISC0003I The streams@jhqin instance was started.
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    DNA.distributedNameServerPartitionServerCnt=0
    DNA.distributedNameServerReplicationCnt=1
    DNA.instanceStartedLock=jhqin
    DNA.instanceStartTime=2013-06-26T18:53:49-0400
    DNA.locale=en_US.UTF-8
    DNA.umask=0022
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$

    Attachments

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T19:23:47Z  
    • DennyHatz
    • ‏2013-06-26T18:22:57Z

    I assume from looking at the logs that the interface you want to be using (public) is the eth1 199.241.160.xxx address correct?

    Can you turn on additional logging by issuing:

     streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

    Then attach the logs after trying to start the instance.

    Thank you

     

    Hi DennyHatz,

    Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

    By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

    [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
    CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
    CDISC0059I The system is starting the streams@jhqin instance.
    CDISC0078I The system is starting the runtime services on 4 hosts.
    CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
    CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
    CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
    CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
    CDISC0003I The streams@jhqin instance was started.
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    DNA.distributedNameServerPartitionServerCnt=0
    DNA.distributedNameServerReplicationCnt=1
    DNA.instanceStartedLock=jhqin
    DNA.instanceStartTime=2013-06-26T18:53:49-0400
    DNA.locale=en_US.UTF-8
    DNA.umask=0022
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$

    Attachments

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T19:31:26Z  

    Hi DennyHatz,

    Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

    By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

    [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
    CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
    CDISC0059I The system is starting the streams@jhqin instance.
    CDISC0078I The system is starting the runtime services on 4 hosts.
    CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
    CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
    CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
    CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
    CDISC0003I The streams@jhqin instance was started.
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    DNA.distributedNameServerPartitionServerCnt=0
    DNA.distributedNameServerReplicationCnt=1
    DNA.instanceStartedLock=jhqin
    DNA.instanceStartTime=2013-06-26T18:53:49-0400
    DNA.locale=en_US.UTF-8
    DNA.umask=0022
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$

    Just uploaded the logs again ... just in case the previous one didn't uploaded completely.

    Attachments

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T19:34:50Z  

    Hi DennyHatz,

    Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

    By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

    [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
    CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
    CDISC0059I The system is starting the streams@jhqin instance.
    CDISC0078I The system is starting the runtime services on 4 hosts.
    CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
    CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
    CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
    CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
    CDISC0003I The streams@jhqin instance was started.
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    DNA.distributedNameServerPartitionServerCnt=0
    DNA.distributedNameServerReplicationCnt=1
    DNA.instanceStartedLock=jhqin
    DNA.instanceStartTime=2013-06-26T18:53:49-0400
    DNA.locale=en_US.UTF-8
    DNA.umask=0022
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$

    Just uploaded the logs again ... just in case the previous one didn't uploaded completely.

    Attachments

  • DennyHatz
    DennyHatz
    102 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-26T21:18:36Z  

    Hi DennyHatz,

    Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

    By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

    [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
    CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
    CDISC0059I The system is starting the streams@jhqin instance.
    CDISC0078I The system is starting the runtime services on 4 hosts.
    CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
    CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
    CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
    CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
    CDISC0003I The streams@jhqin instance was started.
    [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
    AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
    AAS.TraceLevel=default
    ConfigVersion=5.0
    DNA.distributedNameServerPartitionServerCnt=0
    DNA.distributedNameServerReplicationCnt=1
    DNA.instanceStartedLock=jhqin
    DNA.instanceStartTime=2013-06-26T18:53:49-0400
    DNA.locale=en_US.UTF-8
    DNA.umask=0022
    HC.MetricCollectionInterval=3
    HC.PecStartTimeout=30
    HC.PecStopTimeout=30
    HC.PEC.TraceLevel=default
    HC.TraceLevel=default
    HostLoadProtection=false
    HostLoadThreshold=100
    InfrastructureTraceLevel=trace
    InstanceId=streams@jhqin
    LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
    LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
    LogFileMaxFiles=3
    LogFileMaxSize=5000
    LogLevel=warn
    LogPath=/tmp
    LogType=file
    NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
    NS.MaxReplication=1
    NS.NumPartitions=1
    NS.TraceLevel=default
    OrbGiopMaxMsgSize=33554432
    PamEnableKey=true
    PamService=login
    RecoveryMode=off
    SAM.TraceLevel=default
    SCH.TraceLevel=default
    SecurityPublicKeyDirectory=/home/jhqin/.streams/key
    SecuritySessionTimeout=14400
    SRM.TraceLevel=default
    StreamsServiceStartTimeout=30
    SWS.certificateAuthenticationFormat=${cn}
    SWS.enableClientAuthentication=false
    SWS.graphThreshold=2000
    SWS.httpPort=OFF
    SWS.httpsPort=0
    SWS.jvmInitialSize=256
    SWS.jvmMaximumSize=512
    SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
    SWSPath=/tmp
    SWS.StartupPingRetryCount=30
    SWS.TraceLevel=default
    SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
    TraceFileMaxFiles=3
    TraceFileMaxSize=5000
    [jhqin@ecco-computer4 bin]$

    If you still have Streams 3.0 installed, could you create a similar instance using Streams 3.0 on these nodes, then turn on additional tracing by using:

     streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

    Then again start the 3.0 instance and collect the logs.

    Thank you for your patience

     

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-27T15:18:27Z  
    • DennyHatz
    • ‏2013-06-26T21:18:36Z

    If you still have Streams 3.0 installed, could you create a similar instance using Streams 3.0 on these nodes, then turn on additional tracing by using:

     streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

    Then again start the 3.0 instance and collect the logs.

    Thank you for your patience

     

    Hi Denny,

    Thanks for help. Attached please find the output by running streamtool from command line and the instance logs. Both instances run across two hosts, i.e. one was on the two hosts with Streams 3.0 installed and the other was on another two hosts with Streams 3.1 installed.  Each host only has one version of Streams installed.

    Both instances ran in a similar envrionment except with different version of Streams installation, and for hosts with Streams 3.0 the ulimit setting was even lower than those with Streams 3.1, however, we still have the hc failour on the child host when using Stream 3.1. Both instances didn't have any jobs running yet. Hope you could find some clue from these logs, and please feel free to let me know if you need any other information. Your help is really appreciated, thanks!

     

    Jinhui 

  • DennyHatz
    DennyHatz
    102 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-27T15:55:17Z  
    This reply was deleted by Jinhui Qin 2013-06-27T17:05:58Z. Reason for deletion: It is duplicated as being posted twice for some reason.

    Jinhui

    It looks like you have discovered a bug in the Streams 3.1 code.  We are currently working on a fix and or a work around.  I will post back later today with what we come up with.

    Thanks again for your patience!

    Denny

  • DennyHatz
    DennyHatz
    102 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-28T12:57:44Z  
    • DennyHatz
    • ‏2013-06-27T15:55:17Z

    Jinhui

    It looks like you have discovered a bug in the Streams 3.1 code.  We are currently working on a fix and or a work around.  I will post back later today with what we come up with.

    Thanks again for your patience!

    Denny

    Jinhui

    Thanks for your patience!  We have come up with a possible work around for you.

    Before you try to start your instance:

    Issue the following streamtool command which will set a property for  the instance configured to use Streams 3.1 that should force the use of your eth1 interface card vs. the eth0 which it now seems to be selecting.

    streamtool setproperty -i <instanceid> DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1"

    Now start the instance

    I believe all your hosts should report as healthy now and things should work as normal.

    Please report back on your results.

    Thank you

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-07-10T19:03:21Z  
    • DennyHatz
    • ‏2013-06-28T12:57:44Z

    Jinhui

    Thanks for your patience!  We have come up with a possible work around for you.

    Before you try to start your instance:

    Issue the following streamtool command which will set a property for  the instance configured to use Streams 3.1 that should force the use of your eth1 interface card vs. the eth0 which it now seems to be selecting.

    streamtool setproperty -i <instanceid> DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1"

    Now start the instance

    I believe all your hosts should report as healthy now and things should work as normal.

    Please report back on your results.

    Thank you

    Denny,

    Sorry for the late reply because I was busy on something else these days. Thanks for the work around solution. After I did what you suggested, it did solve the previous problem of getting a "Partially Failed" instance. Now all the hosts were healthy and schedulable.


    However, when I submitted a distributed job to the running instance, it was automatically deployed on multiple hosts, but, the job failed to run properly. The job that I used for testing this instance was simply imported from one of the sample applications called "TaskParallel" that came with the Streams 3.1 package. Before the job was launched, I had selected the "trace output level" to "trace" in this job's launch configuration.  




    By looking into the trace logs of this job, it seems that those job PEs' failed to communicate across hosts.  If PEs couldn't communicate across multiple hosts, it would be useless to have an instance across multiple hosts.




    I was wondering if it was possible that was because the application traffic was set differently from the instance control traffic in Streams 3.1. Could you please take a close look at the logs that I attached to this post, especially the job logs, (i.e. those "job:0.pec:*.trace" file after you extract the attached .tar file).  I really appreciate you for all your helps. Thanks!!




    Along with the tar file, I also attached other two files that recorded the setting for the instance from using the streamtool command as you suggested and the information about the testing sample application. Hope these could provide you with enough information, and pleaes feel free to let me know if you need any more information.



     

    Jinhui

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-07-11T18:11:54Z  
    • DennyHatz
    • ‏2013-06-28T12:57:44Z

    Jinhui

    Thanks for your patience!  We have come up with a possible work around for you.

    Before you try to start your instance:

    Issue the following streamtool command which will set a property for  the instance configured to use Streams 3.1 that should force the use of your eth1 interface card vs. the eth0 which it now seems to be selecting.

    streamtool setproperty -i <instanceid> DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1"

    Now start the instance

    I believe all your hosts should report as healthy now and things should work as normal.

    Please report back on your results.

    Thank you

    Denny,

     

    I did a similar comparison as what you suggested before by running the same job on a 4-node cluster  with Streams 3.0 installed and on another 4-node cluster with Streams 3.1 installed,  then recorded the job trace logs from both environment. To make sure they were comparable, I have set DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1" on both instances before they were started. Then I used the same sample application "TaskParallel" as the testing job and submitted to the two instances.

     

    The job on Streams 3.0 runs properly, however it failed on Streams 3.1. I looked into one of the trace file from both of the job trace logs, i.e. job:0.pec.0.trace, I did find the differences between the two files, it seems that when the job running on Streams 3.1,  one of the NAM.LookupEntry call returned as "Got object with the partition server:10.18.20.240:36819", which is configured as our admin network (eth0) and it has been locked down. This might be the reason for the failure on Streams 3.1. While the same call on Streams 3.0 returned as  "Got NameService::not_found", after throwing an exception the process continued.

     

    Attached please find the two trace log files, I have highlighted some differences, especially for the the lines starting from the timestamp 10 Jul 2013 13:24:50.347 in the "StreamsV3.1_job:0.pec:0.trace" file and the lines starting from the timestamp 11 Jul 2013 10:33:59.591 in the "StreamsV3.0_job:0.pec:0.trace" file.

     

    It seems that by setting DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1 there were still some traffic that attempted using eth0 and then stuck there in Streams 3.1 when runing jobs. Would you please take a look at the log files and give us any more suggestion?

     

    Thanks

     

    Jinhui

  • DennyHatz
    DennyHatz
    102 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-07-15T18:41:10Z  

    Denny,

     

    I did a similar comparison as what you suggested before by running the same job on a 4-node cluster  with Streams 3.0 installed and on another 4-node cluster with Streams 3.1 installed,  then recorded the job trace logs from both environment. To make sure they were comparable, I have set DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1" on both instances before they were started. Then I used the same sample application "TaskParallel" as the testing job and submitted to the two instances.

     

    The job on Streams 3.0 runs properly, however it failed on Streams 3.1. I looked into one of the trace file from both of the job trace logs, i.e. job:0.pec.0.trace, I did find the differences between the two files, it seems that when the job running on Streams 3.1,  one of the NAM.LookupEntry call returned as "Got object with the partition server:10.18.20.240:36819", which is configured as our admin network (eth0) and it has been locked down. This might be the reason for the failure on Streams 3.1. While the same call on Streams 3.0 returned as  "Got NameService::not_found", after throwing an exception the process continued.

     

    Attached please find the two trace log files, I have highlighted some differences, especially for the the lines starting from the timestamp 10 Jul 2013 13:24:50.347 in the "StreamsV3.1_job:0.pec:0.trace" file and the lines starting from the timestamp 11 Jul 2013 10:33:59.591 in the "StreamsV3.0_job:0.pec:0.trace" file.

     

    It seems that by setting DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1 there were still some traffic that attempted using eth0 and then stuck there in Streams 3.1 when runing jobs. Would you please take a look at the log files and give us any more suggestion?

     

    Thanks

     

    Jinhui

    Jinhui

    You have discovered a bug in the Streams 3.1 code.  We are currently working on a fix.  Sorry but the workaround didn't fix the problem when PE's connect.  If you have a Streams 3.1 with IBM support, please contact IBM support to get a fix for this problem.  If you do not have IBM support, you will need to wait for the next Streams 3.1 fixpack to be released.

    Sorry for any confusion, or delay this may cause you.

    Denny

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-07-15T20:29:37Z  
    • DennyHatz
    • ‏2013-07-15T18:41:10Z

    Jinhui

    You have discovered a bug in the Streams 3.1 code.  We are currently working on a fix.  Sorry but the workaround didn't fix the problem when PE's connect.  If you have a Streams 3.1 with IBM support, please contact IBM support to get a fix for this problem.  If you do not have IBM support, you will need to wait for the next Streams 3.1 fixpack to be released.

    Sorry for any confusion, or delay this may cause you.

    Denny

    Denny,

    Thanks for your reply. We may just keep using our current Streams 3.0 and consider the upgrade to Streams 3.1 later when the fixpack for Streams 3.1 is available. Thanks again for your helps.

    Jinhui

  • Jinhui Qin
    Jinhui Qin
    17 Posts

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-11-20T18:48:57Z  

    Denny,

    Thanks for your reply. We may just keep using our current Streams 3.0 and consider the upgrade to Streams 3.1 later when the fixpack for Streams 3.1 is available. Thanks again for your helps.

    Jinhui

    I just did some test and found out that the problem was fixed in Streams 3.2. Now we are planing to upgrade our environment from Streams 3.0 to Streams 3.2.