Topic
22 replies Latest Post - ‏2013-11-20T18:48:57Z by Jinhui Qin
Jinhui Qin
Jinhui Qin
17 Posts
ACCEPTED ANSWER

Pinned topic Partially Failed when starting an instance on multiple hosts in Streams 3.1

‏2013-06-21T15:48:53Z |

Hi,

With all the new features provided in Streams 3.1, we are considering to upgrade our current version of Streams from 3.0 to 3.1. Recently we tested the installation of Streams 3.1 on a cluster of CentOS 6 nodes. The installation of Streams 3.1 (on all nodes) and the Streams Studio (only on the head node) was successful. I was able to create and start an instance across multiple hosts successfully, but after a few seconds the "hc" (host controller service) services on all nodes failed except for the one on the head node, then the instance became "partially failed". Then I was able to use the new feature provided in the Streams 3.1 Web  Consel to repair the intance sucessfully, but after a while, those "hc" on all child nodes failed again, all these happend when I didn't even submit any jobs to the instance. It seems that  all the services running on the head node were all fine. We didn't have such a problem when using Streams 3.0 on a cluster. Anyone could provide us any clue what could cause the problem? or anything we need to adjust in Streams 3.1?  Attached is the logs for the instance that I downloaded from the Streams Console, we really hope anyone here could help us out. Thanks!!

 

Jinhui

 

 

Attachments

  • Stan
    Stan
    76 Posts
    ACCEPTED ANSWER

    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

    ‏2013-06-24T15:42:47Z  in response to Jinhui Qin

    Please increase the process limit and see if the failure still happens

    ### 21 Jun 2013 11:08:17 END:   .. verifyInstallCompat() - rc:0
    #########################################################################################
    #####  WARNING WARNING WARNING !! ULIMIT CHECK on Host ecco-computer19.sharcnet.ca
    #####  ulimit max user processes (-u) setting of 1024 is LOW
    #####  See InfoSphere Streams Information Center for ulimit recommendations
    #########################################################################################

    • Jinhui Qin
      Jinhui Qin
      17 Posts
      ACCEPTED ANSWER

      Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

      ‏2013-06-24T17:00:36Z  in response to Stan

      Stan,

      Thanks for your reply. I realized this warning message, but when I installed Streams 3.0, the same warning message appeared and without any adjustment, everything seemed worked fine in Streams 3.0.  Anyways, I will try to adjust the max ulimit setting to see if the problem can be solved. 

       

      Jinhui

       

    • Jinhui Qin
      Jinhui Qin
      17 Posts
      ACCEPTED ANSWER

      Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

      ‏2013-06-25T13:41:34Z  in response to Stan

      Stan,

      We have tried by increasing the process limit from 1024 to 65536 on all nodes, the warning message is gone, but the failure still happened after the instance started properly for a couple of min. and I didn't even submit any jobs to the instance yet, the hc services on all the child nodes failed. Attached is the new logs that I downloaded from the instance. Could you or anyone find any clue from it and give us any more advises ? Your help is really appreciated.

       

      Jinhui 

      Attachments

      • jingdongsun
        jingdongsun
        3 Posts
        ACCEPTED ANSWER

        Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

        ‏2013-06-26T03:57:52Z  in response to Jinhui Qin

        please run streamtool checkhost to see if any reported error, especially about network connection among hosts.

        And also, please verify to make sure firewall are disabled among all hosts.

        • Jinhui Qin
          Jinhui Qin
          17 Posts
          ACCEPTED ANSWER

          Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

          ‏2013-06-26T14:24:30Z  in response to jingdongsun

          Thanks Jingdong for your reply. At the beginning we thought it might be a firewall issue, we made sure that the firewall was turned off on all hosts, however the failure still happened.

          According to your suggestion, the output from "streamtool checkhost " didn't show any errors, this was an instance running on three hosts, 

           

          [jhqin@ecco-computer19 ~]$ streamtool checkhost

          Date: Wed Jun 26 10:00:32 EDT 2013
          Host: ecco-computer19  
          Instance: streams@jhqin
          3 Hosts to check: 199.241.160.146,199.241.160.148,199.241.160.139
          Reference host: 199.241.160.146



          =============================================================
          Phase 1 - per-host public key ssh connectivity test...
          =============================================================

          Checking host 1 of 3: 199.241.160.146...  host OK
          Checking host 2 of 3: 199.241.160.148...  host OK
          Checking host 3 of 3: 199.241.160.139...  host OK

          Phase 1 - public key ssh connectivity test summary:
          3 OK hosts.
          0 problem hosts:



          =============================================================
          Phase 2 - per-host dependency checking...
          =============================================================

          Checking host 1 of 3: 199.241.160.146...  host OK
          Checking host 2 of 3: 199.241.160.148...  host OK
          Checking host 3 of 3: 199.241.160.139...  host OK

          Phase 2 - per host dependency checking summary:
          3 OK hosts.
          0 problem hosts:
          0 problem categories:


          =============================================================
          Detailed host results
          Verbosity level: 1
          =============================================================




          =============================================================
          Overall Summary
          =============================================================

          3 hosts checked.
          3 OK hosts.
          0 problem hosts:

           

          _________________

          Still, the hc services on two child nodes failed, everything on the head node is fine. Attached is the logs for this instance, again, click on "Repair Instance" from Streams Console did make them all running healthy for a few seconds, once you refresh, it became "Partially failed" again for the same reason. Any other suggestion?  

          Attachments

          • jingdongsun
            jingdongsun
            3 Posts
            ACCEPTED ANSWER

            Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

            ‏2013-06-26T16:23:18Z  in response to Jinhui Qin

            Based on the trace, all services are up alright, but still, all Corba calls cross hosts failed.

            I still think this is a network issue, but I do not know what next step we need to check, possibly double check all host network settings?

            Also, "streamtool checkhost --connectivity-only" may also give some clue and worth a try, as it will do more cross-checking.

            Another suggestion is that, if you do not need multiple hosts for current work, please try to run instance with single host

            Thanks.

            • Jinhui Qin
              Jinhui Qin
              17 Posts
              ACCEPTED ANSWER

              Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

              ‏2013-06-26T17:00:24Z  in response to jingdongsun

              "streamtool checkhost --connectivity-only" also gave me no errors, this is an instance running on 4 hosts with the same problem as before, Single host instance only has two cores, which is not enough for running our jobs. We do have a successful environment for running Streams job crossing mutliple hosts in Streams 3.0. Since Streams 3.1 was released, we were considering to upgrade to Streams 3.1, however we ran into such a problem. I agreed with you that this was a network issure. 

              At the " Streams 3.1 Developers Conference Webcast on June 6", I learned from one of the speakers (Denny Hatzenbihler ?)  who talked about Streams Runtime, he mentioned that in Streams 3.1 the applilcation network setting was done differently for the performance concideration. Streams 3.1 separated the control traffic from user application traffic, I was just gussing if that could be related to the issue we encoutered, and we need to do some adjustment somewhere when using Streams 3.1, but we don't know how and where.

              Thanks for your suggestions. We are still hoping someone here can help us in solving the problem. 

               

              here is the output from "streamtool checkhost" and attached is the logs for this instance that ran on 4 hosts.

              [jhqin@ecco-computer4 bin]$ streamtool checkhost --connectivity-only

              Checking connectivity between the following hosts:

              199.241.160.131,199.241.160.130,199.241.160.167,199.241.160.168

              Checking host: 199.241.160.131...
              Checking host: 199.241.160.130...
              Checking host: 199.241.160.167...
              Checking host: 199.241.160.168...

              There were no failures found validating connectivity between hosts.

              [jhqin@ecco-computer4 bin]$ streamtool checkhost -a

              Date: Wed Jun 26 12:28:43 EDT 2013
              Host: ecco-computer4  
              Instance: streams@jhqin
              4 Hosts to check: 199.241.160.131,199.241.160.130,199.241.160.167,199.241.160.168
              Reference host: 199.241.160.131



              =============================================================
              Phase 1 - per-host public key ssh connectivity test...
              =============================================================

              Checking host 1 of 4: 199.241.160.131...  host OK
              Checking host 2 of 4: 199.241.160.130...  host OK
              Checking host 3 of 4: 199.241.160.167...  host OK
              Checking host 4 of 4: 199.241.160.168...  host OK

              Phase 1 - public key ssh connectivity test summary:
              4 OK hosts.
              0 problem hosts:



              =============================================================
              Phase 2 - per-host dependency checking...
              =============================================================

              Checking host 1 of 4: 199.241.160.131...  host OK
              Checking host 2 of 4: 199.241.160.130...  host OK
              Checking host 3 of 4: 199.241.160.167...  host OK
              Checking host 4 of 4: 199.241.160.168...  host OK

              Phase 2 - per host dependency checking summary:
              4 OK hosts.
              0 problem hosts:
              0 problem categories:


              =============================================================
              Detailed host results
              Verbosity level: 1
              =============================================================




              =============================================================
              Overall Summary
              =============================================================

              4 hosts checked.
              4 OK hosts.
              0 problem hosts:

               

              Attachments

            • Jinhui Qin
              Jinhui Qin
              17 Posts
              ACCEPTED ANSWER

              Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

              ‏2013-06-26T17:40:19Z  in response to jingdongsun

              Regarding to the network configureation of all the hosts in our cluster,  our system admin just reminded that all the hosts in the cluster are dual-homed, with public and admin networks. The admin network is locked down and will not allow communication with any non-whitelisted servers (which these are not). This didn't seem to be an issue with Streams 3.0.

              • DennyHatz
                DennyHatz
                102 Posts
                ACCEPTED ANSWER

                Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                ‏2013-06-26T18:22:57Z  in response to Jinhui Qin

                I assume from looking at the logs that the interface you want to be using (public) is the eth1 199.241.160.xxx address correct?

                Can you turn on additional logging by issuing:

                 streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

                Then attach the logs after trying to start the instance.

                Thank you

                 

                • Jinhui Qin
                  Jinhui Qin
                  17 Posts
                  ACCEPTED ANSWER

                  Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                  ‏2013-06-26T19:20:09Z  in response to DennyHatz

                  Hi DennyHatz,

                  Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

                  By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

                  [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
                  CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
                  [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
                  AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
                  AAS.TraceLevel=default
                  ConfigVersion=5.0
                  HC.MetricCollectionInterval=3
                  HC.PecStartTimeout=30
                  HC.PecStopTimeout=30
                  HC.PEC.TraceLevel=default
                  HC.TraceLevel=default
                  HostLoadProtection=false
                  HostLoadThreshold=100
                  InfrastructureTraceLevel=trace
                  InstanceId=streams@jhqin
                  LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
                  LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
                  LogFileMaxFiles=3
                  LogFileMaxSize=5000
                  LogLevel=warn
                  LogPath=/tmp
                  LogType=file
                  NameServiceUrl=DN:
                  NS.MaxReplication=1
                  NS.NumPartitions=1
                  NS.TraceLevel=default
                  OrbGiopMaxMsgSize=33554432
                  PamEnableKey=true
                  PamService=login
                  RecoveryMode=off
                  SAM.TraceLevel=default
                  SCH.TraceLevel=default
                  SecurityPublicKeyDirectory=/home/jhqin/.streams/key
                  SecuritySessionTimeout=14400
                  SRM.TraceLevel=default
                  StreamsServiceStartTimeout=30
                  SWS.certificateAuthenticationFormat=${cn}
                  SWS.enableClientAuthentication=false
                  SWS.graphThreshold=2000
                  SWS.httpPort=OFF
                  SWS.httpsPort=0
                  SWS.jvmInitialSize=256
                  SWS.jvmMaximumSize=512
                  SWSPath=/tmp
                  SWS.StartupPingRetryCount=30
                  SWS.TraceLevel=default
                  TraceFileMaxFiles=3
                  TraceFileMaxSize=5000
                  [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
                  CDISC0059I The system is starting the streams@jhqin instance.
                  CDISC0078I The system is starting the runtime services on 4 hosts.
                  CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
                  CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
                  CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
                  CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
                  CDISC0003I The streams@jhqin instance was started.
                  [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
                  AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
                  AAS.TraceLevel=default
                  ConfigVersion=5.0
                  DNA.distributedNameServerPartitionServerCnt=0
                  DNA.distributedNameServerReplicationCnt=1
                  DNA.instanceStartedLock=jhqin
                  DNA.instanceStartTime=2013-06-26T18:53:49-0400
                  DNA.locale=en_US.UTF-8
                  DNA.umask=0022
                  HC.MetricCollectionInterval=3
                  HC.PecStartTimeout=30
                  HC.PecStopTimeout=30
                  HC.PEC.TraceLevel=default
                  HC.TraceLevel=default
                  HostLoadProtection=false
                  HostLoadThreshold=100
                  InfrastructureTraceLevel=trace
                  InstanceId=streams@jhqin
                  LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
                  LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
                  LogFileMaxFiles=3
                  LogFileMaxSize=5000
                  LogLevel=warn
                  LogPath=/tmp
                  LogType=file
                  NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
                  NS.MaxReplication=1
                  NS.NumPartitions=1
                  NS.TraceLevel=default
                  OrbGiopMaxMsgSize=33554432
                  PamEnableKey=true
                  PamService=login
                  RecoveryMode=off
                  SAM.TraceLevel=default
                  SCH.TraceLevel=default
                  SecurityPublicKeyDirectory=/home/jhqin/.streams/key
                  SecuritySessionTimeout=14400
                  SRM.TraceLevel=default
                  StreamsServiceStartTimeout=30
                  SWS.certificateAuthenticationFormat=${cn}
                  SWS.enableClientAuthentication=false
                  SWS.graphThreshold=2000
                  SWS.httpPort=OFF
                  SWS.httpsPort=0
                  SWS.jvmInitialSize=256
                  SWS.jvmMaximumSize=512
                  SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
                  SWSPath=/tmp
                  SWS.StartupPingRetryCount=30
                  SWS.TraceLevel=default
                  SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
                  TraceFileMaxFiles=3
                  TraceFileMaxSize=5000
                  [jhqin@ecco-computer4 bin]$

                  Attachments

                  • Jinhui Qin
                    Jinhui Qin
                    17 Posts
                    ACCEPTED ANSWER

                    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                    ‏2013-06-26T19:31:26Z  in response to Jinhui Qin

                    Just uploaded the logs again ... just in case the previous one didn't uploaded completely.

                    Attachments

                  • Jinhui Qin
                    Jinhui Qin
                    17 Posts
                    ACCEPTED ANSWER

                    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                    ‏2013-06-26T19:34:50Z  in response to Jinhui Qin

                    Just uploaded the logs again ... just in case the previous one didn't uploaded completely.

                    Attachments

                  • This reply was deleted by DennyHatz 2013-06-26T21:18:09Z. Reason for deletion: out of sequence
                • Jinhui Qin
                  Jinhui Qin
                  17 Posts
                  ACCEPTED ANSWER

                  Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                  ‏2013-06-26T19:23:47Z  in response to DennyHatz

                  Hi DennyHatz,

                  Thanks for your quick response. Yes,  you are correct about the public IP address that we used. 

                  By following your suggestions, here is the output from the comman line, I have also attached the logs for this instance that I just created crossing 4 hosts. Hope you can help us find any clue. Thanks!

                  [jhqin@ecco-computer4 bin]$ streamtool setproperty InfrastructureTraceLevel=trace -i streams
                  CDISC0008I The InfrastructureTraceLevel property was set to "trace" for the streams@jhqin instance. The previous property value was "error".
                  [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
                  AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
                  AAS.TraceLevel=default
                  ConfigVersion=5.0
                  HC.MetricCollectionInterval=3
                  HC.PecStartTimeout=30
                  HC.PecStopTimeout=30
                  HC.PEC.TraceLevel=default
                  HC.TraceLevel=default
                  HostLoadProtection=false
                  HostLoadThreshold=100
                  InfrastructureTraceLevel=trace
                  InstanceId=streams@jhqin
                  LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
                  LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
                  LogFileMaxFiles=3
                  LogFileMaxSize=5000
                  LogLevel=warn
                  LogPath=/tmp
                  LogType=file
                  NameServiceUrl=DN:
                  NS.MaxReplication=1
                  NS.NumPartitions=1
                  NS.TraceLevel=default
                  OrbGiopMaxMsgSize=33554432
                  PamEnableKey=true
                  PamService=login
                  RecoveryMode=off
                  SAM.TraceLevel=default
                  SCH.TraceLevel=default
                  SecurityPublicKeyDirectory=/home/jhqin/.streams/key
                  SecuritySessionTimeout=14400
                  SRM.TraceLevel=default
                  StreamsServiceStartTimeout=30
                  SWS.certificateAuthenticationFormat=${cn}
                  SWS.enableClientAuthentication=false
                  SWS.graphThreshold=2000
                  SWS.httpPort=OFF
                  SWS.httpsPort=0
                  SWS.jvmInitialSize=256
                  SWS.jvmMaximumSize=512
                  SWSPath=/tmp
                  SWS.StartupPingRetryCount=30
                  SWS.TraceLevel=default
                  TraceFileMaxFiles=3
                  TraceFileMaxSize=5000
                  [jhqin@ecco-computer4 bin]$ streamtool startinstance -i streams
                  CDISC0059I The system is starting the streams@jhqin instance.
                  CDISC0078I The system is starting the runtime services on 4 hosts.
                  CDISC0056I The system is starting the distributed name service on the 199.241.160.131 host. The distributed name service has 1 partitions and 1 replications.
                  CDISC0057I The system is setting the NameServiceUrl property of the instance to DN:ecco-computer4.sharcnet.ca:42903, which is the URL of the distributed name service that is running.
                  CDISC0061I The system is starting in parallel the runtime services of 1 management hosts.
                  CDISC0060I The system is starting in parallel the runtime services of 3 application hosts.
                  CDISC0003I The streams@jhqin instance was started.
                  [jhqin@ecco-computer4 bin]$ streamtool getproperty -i streams -a
                  AAS.ConfigFile=/home/jhqin/.streams/instances/streams@jhqin/config/security-config.xml
                  AAS.TraceLevel=default
                  ConfigVersion=5.0
                  DNA.distributedNameServerPartitionServerCnt=0
                  DNA.distributedNameServerReplicationCnt=1
                  DNA.instanceStartedLock=jhqin
                  DNA.instanceStartTime=2013-06-26T18:53:49-0400
                  DNA.locale=en_US.UTF-8
                  DNA.umask=0022
                  HC.MetricCollectionInterval=3
                  HC.PecStartTimeout=30
                  HC.PecStopTimeout=30
                  HC.PEC.TraceLevel=default
                  HC.TraceLevel=default
                  HostLoadProtection=false
                  HostLoadThreshold=100
                  InfrastructureTraceLevel=trace
                  InstanceId=streams@jhqin
                  LLMInputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.inputport.properties
                  LLMOutputPortConfigFile=%STREAMS_INSTALL%/etc/cfg/llm.outputport.properties
                  LogFileMaxFiles=3
                  LogFileMaxSize=5000
                  LogLevel=warn
                  LogPath=/tmp
                  LogType=file
                  NameServiceUrl=DN:ecco-computer4.sharcnet.ca:42903
                  NS.MaxReplication=1
                  NS.NumPartitions=1
                  NS.TraceLevel=default
                  OrbGiopMaxMsgSize=33554432
                  PamEnableKey=true
                  PamService=login
                  RecoveryMode=off
                  SAM.TraceLevel=default
                  SCH.TraceLevel=default
                  SecurityPublicKeyDirectory=/home/jhqin/.streams/key
                  SecuritySessionTimeout=14400
                  SRM.TraceLevel=default
                  StreamsServiceStartTimeout=30
                  SWS.certificateAuthenticationFormat=${cn}
                  SWS.enableClientAuthentication=false
                  SWS.graphThreshold=2000
                  SWS.httpPort=OFF
                  SWS.httpsPort=0
                  SWS.jvmInitialSize=256
                  SWS.jvmMaximumSize=512
                  SWS.ks=<undef> (pending value: ks:69f9b5f77d6ce85c3bca303e602e5f8ca3528b39538664060bf49906a87ca66f620b473439914d52)
                  SWSPath=/tmp
                  SWS.StartupPingRetryCount=30
                  SWS.TraceLevel=default
                  SWS.ts=<undef> (pending value: ts:f4e78144c48b754e0dae50ce699e2fd25a004fcf5ad7a7105d9b8e6a6be525bc80546a7653069445)
                  TraceFileMaxFiles=3
                  TraceFileMaxSize=5000
                  [jhqin@ecco-computer4 bin]$

                  Attachments

                  • DennyHatz
                    DennyHatz
                    102 Posts
                    ACCEPTED ANSWER

                    Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                    ‏2013-06-26T21:18:36Z  in response to Jinhui Qin

                    If you still have Streams 3.0 installed, could you create a similar instance using Streams 3.0 on these nodes, then turn on additional tracing by using:

                     streamtool setproperty InfrastructureTraceLevel=trace -i <yourinstanceid>

                    Then again start the 3.0 instance and collect the logs.

                    Thank you for your patience

                     

                    • Jinhui Qin
                      Jinhui Qin
                      17 Posts
                      ACCEPTED ANSWER

                      Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                      ‏2013-06-27T15:18:27Z  in response to DennyHatz

                      Hi Denny,

                      Thanks for help. Attached please find the output by running streamtool from command line and the instance logs. Both instances run across two hosts, i.e. one was on the two hosts with Streams 3.0 installed and the other was on another two hosts with Streams 3.1 installed.  Each host only has one version of Streams installed.

                      Both instances ran in a similar envrionment except with different version of Streams installation, and for hosts with Streams 3.0 the ulimit setting was even lower than those with Streams 3.1, however, we still have the hc failour on the child host when using Stream 3.1. Both instances didn't have any jobs running yet. Hope you could find some clue from these logs, and please feel free to let me know if you need any other information. Your help is really appreciated, thanks!

                       

                      Jinhui 

                    • This reply was deleted by Jinhui Qin 2013-06-27T17:05:58Z. Reason for deletion: It is duplicated as being posted twice for some reason.
                      • DennyHatz
                        DennyHatz
                        102 Posts
                        ACCEPTED ANSWER

                        Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                        ‏2013-06-27T15:55:17Z  in response to Jinhui Qin

                        Jinhui

                        It looks like you have discovered a bug in the Streams 3.1 code.  We are currently working on a fix and or a work around.  I will post back later today with what we come up with.

                        Thanks again for your patience!

                        Denny

                        • DennyHatz
                          DennyHatz
                          102 Posts
                          ACCEPTED ANSWER

                          Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                          ‏2013-06-28T12:57:44Z  in response to DennyHatz

                          Jinhui

                          Thanks for your patience!  We have come up with a possible work around for you.

                          Before you try to start your instance:

                          Issue the following streamtool command which will set a property for  the instance configured to use Streams 3.1 that should force the use of your eth1 interface card vs. the eth0 which it now seems to be selecting.

                          streamtool setproperty -i <instanceid> DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1"

                          Now start the instance

                          I believe all your hosts should report as healthy now and things should work as normal.

                          Please report back on your results.

                          Thank you

                          • Jinhui Qin
                            Jinhui Qin
                            17 Posts
                            ACCEPTED ANSWER

                            Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                            ‏2013-07-10T19:03:21Z  in response to DennyHatz

                            Denny,

                            Sorry for the late reply because I was busy on something else these days. Thanks for the work around solution. After I did what you suggested, it did solve the previous problem of getting a "Partially Failed" instance. Now all the hosts were healthy and schedulable.


                            However, when I submitted a distributed job to the running instance, it was automatically deployed on multiple hosts, but, the job failed to run properly. The job that I used for testing this instance was simply imported from one of the sample applications called "TaskParallel" that came with the Streams 3.1 package. Before the job was launched, I had selected the "trace output level" to "trace" in this job's launch configuration.  




                            By looking into the trace logs of this job, it seems that those job PEs' failed to communicate across hosts.  If PEs couldn't communicate across multiple hosts, it would be useless to have an instance across multiple hosts.




                            I was wondering if it was possible that was because the application traffic was set differently from the instance control traffic in Streams 3.1. Could you please take a close look at the logs that I attached to this post, especially the job logs, (i.e. those "job:0.pec:*.trace" file after you extract the attached .tar file).  I really appreciate you for all your helps. Thanks!!




                            Along with the tar file, I also attached other two files that recorded the setting for the instance from using the streamtool command as you suggested and the information about the testing sample application. Hope these could provide you with enough information, and pleaes feel free to let me know if you need any more information.



                             

                            Jinhui

                          • Jinhui Qin
                            Jinhui Qin
                            17 Posts
                            ACCEPTED ANSWER

                            Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                            ‏2013-07-11T18:11:54Z  in response to DennyHatz

                            Denny,

                             

                            I did a similar comparison as what you suggested before by running the same job on a 4-node cluster  with Streams 3.0 installed and on another 4-node cluster with Streams 3.1 installed,  then recorded the job trace logs from both environment. To make sure they were comparable, I have set DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1" on both instances before they were started. Then I used the same sample application "TaskParallel" as the testing job and submitted to the two instances.

                             

                            The job on Streams 3.0 runs properly, however it failed on Streams 3.1. I looked into one of the trace file from both of the job trace logs, i.e. job:0.pec.0.trace, I did find the differences between the two files, it seems that when the job running on Streams 3.1,  one of the NAM.LookupEntry call returned as "Got object with the partition server:10.18.20.240:36819", which is configured as our admin network (eth0) and it has been locked down. This might be the reason for the failure on Streams 3.1. While the same call on Streams 3.0 returned as  "Got NameService::not_found", after throwing an exception the process continued.

                             

                            Attached please find the two trace log files, I have highlighted some differences, especially for the the lines starting from the timestamp 10 Jul 2013 13:24:50.347 in the "StreamsV3.1_job:0.pec:0.trace" file and the lines starting from the timestamp 11 Jul 2013 10:33:59.591 in the "StreamsV3.0_job:0.pec:0.trace" file.

                             

                            It seems that by setting DNA.backDoorEvs="STREAMS_CONTROL_IF=eth1 there were still some traffic that attempted using eth0 and then stuck there in Streams 3.1 when runing jobs. Would you please take a look at the log files and give us any more suggestion?

                             

                            Thanks

                             

                            Jinhui

                            • DennyHatz
                              DennyHatz
                              102 Posts
                              ACCEPTED ANSWER

                              Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                              ‏2013-07-15T18:41:10Z  in response to Jinhui Qin

                              Jinhui

                              You have discovered a bug in the Streams 3.1 code.  We are currently working on a fix.  Sorry but the workaround didn't fix the problem when PE's connect.  If you have a Streams 3.1 with IBM support, please contact IBM support to get a fix for this problem.  If you do not have IBM support, you will need to wait for the next Streams 3.1 fixpack to be released.

                              Sorry for any confusion, or delay this may cause you.

                              Denny

                              • Jinhui Qin
                                Jinhui Qin
                                17 Posts
                                ACCEPTED ANSWER

                                Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                                ‏2013-07-15T20:29:37Z  in response to DennyHatz

                                Denny,

                                Thanks for your reply. We may just keep using our current Streams 3.0 and consider the upgrade to Streams 3.1 later when the fixpack for Streams 3.1 is available. Thanks again for your helps.

                                Jinhui

                                • Jinhui Qin
                                  Jinhui Qin
                                  17 Posts
                                  ACCEPTED ANSWER

                                  Re: Partially Failed when starting an instance on multiple hosts in Streams 3.1

                                  ‏2013-11-20T18:48:57Z  in response to Jinhui Qin

                                  I just did some test and found out that the problem was fixed in Streams 3.2. Now we are planing to upgrade our environment from Streams 3.0 to Streams 3.2.

                          • This reply was deleted by Jinhui Qin 2013-07-12T13:55:12Z. Reason for deletion: duplicated
                          • This reply was deleted by Jinhui Qin 2013-07-12T13:55:33Z. Reason for deletion: duplicated
                          • This reply was deleted by Jinhui Qin 2013-07-12T13:55:50Z. Reason for deletion: duplicated
                          • This reply was deleted by Jinhui Qin 2013-07-12T13:56:09Z. Reason for deletion: duplicated