Topic
12 replies Latest Post - ‏2007-12-21T17:34:14Z by michael-t
astdenis
astdenis
11 Posts
ACCEPTED ANSWER

Pinned topic LoadLeveler 3.4 and SMT.

‏2007-02-22T03:23:54Z |
Hello,

I thought I should try this forum before I opened a sev 4 pmr. Let's see.

We installed LL 3.4 yesterday. The new smt keyword with its default set to "no" caught us by surprise: all our MPI jobs took a significant hit in performance. We worked around this issue by setting smt = yes in the default class. But we don't understand the behaviour. We have a p5+ p575 cluster. Our users run jobs that do not use more than 16 cpus. Smt = no should mean "do not run on the secondary thread", so on a p5+ p575 16 cores node, a 16 thread job should run ok, right? Unless something causes it to bind to 8 cores and their secondary thread. Anybody else experienced something similar?

We activated smt = yes by modifying the default class and then running "llctl -g reconfig". The effect was immediate: the performance hit disappeared. Now, users who are trying to bind threads to processors are seeing inconsistent behaviour. The processor numbering scheme escapes us. For exemple, "bindprocessor -s 0" reports

The available processors are: 2 3 5 7 8 1 12 4 15 13 11 9 14 10 6 0

on most of our nodes. We rebooted a node and it reverted to:

The available processors are: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

What is causing this re-ordering? Our initial attempts at using bindprocessor have always assumed even numbers were the primary threads and odd were the seconday threads. The numbering shown above makes it difficult to programmatically bind to the "right" processors.

Today, a user ran a small openmp job both within LL and interactively. He ran it with and without processor binding. The initial results show that the interactive runs perform much better. What could cause this?

At SP-XXL two weeks ago, Waiman Chan asked for some feedback on the smt control in LL 3.4. Good. Could we get a detailed explanation on what is the current smt control implementation in LL? Our feedback would then be much more meaningful, I'm sure.

Alain St-Denis
MSC
Updated on 2007-12-21T17:34:14Z at 2007-12-21T17:34:14Z by michael-t
  • SystemAdmin
    SystemAdmin
    46 Posts
    ACCEPTED ANSWER

    Re: LoadLeveler 3.4 and SMT.

    ‏2007-02-22T17:55:28Z  in response to astdenis
    Alain,

    LoadLeveler should not be changing the SMT mode, which you have already set up for the system before starting LoadLeveler.
    There was a defect in LoadLeveler to address this issue, which is available in PTF 2.

    What PTF version of LoadLeveler 3.4 did you install in your nodes ?
    If the scenario happens with PTF 2 or later version, we will need to do some more investigation on this issue.

    Changing the SMT mode could change the processor numbering scheme on a node.
    For those who depend on the numbering scheme to do bindprocessor to a task,
    switching SMT on/off needs to be avoided.

    We are looking into putting out a flash to help others to be aware of the above.

    Thanks.

    Regards,
    Waiman
    • astdenis
      astdenis
      11 Posts
      ACCEPTED ANSWER

      Re: LoadLeveler 3.4 and SMT.

      ‏2007-02-22T18:05:01Z  in response to SystemAdmin
      Hi,

      We are running 3.4.0.1. Are you saying that 3.4.0.1 does switch smt on and off and this is fixed in 3.4.0.2? We know smt is enabled on all our nodes. I'v been monitoring the smt status and it does switch back and forth between enabled and disabled on what seems to be random nodes.

      Thanks.
      Alain.
      • SystemAdmin
        SystemAdmin
        46 Posts
        ACCEPTED ANSWER

        Re: LoadLeveler 3.4 and SMT.

        ‏2007-02-23T17:08:16Z  in response to astdenis
        Hi Alain,

        Prior to LL 3.4.0.2, the SMT mode was mistakenly set to 'no' for the job at job start, when nothing related to SMT was specified in the job or in the stanza. And whenever the SMT mode is changed on a node, the processor numbering scheme can also be affected. The problem was fixed in 3.4.0.2. When you restart the node with AIX and LL 3.4.0.2, the SMT mode will not be modified, even when nothing SMT related is specified in the job/stanza. And you will not see the processor numbers issue any more.

        A flash will be put out shortly to warn other users, who are going to migrate to LL V3.4 and running with SMT on.

        Sorry for any inconvenience.

        Regards,
        Waiman
        • gcorneau
          gcorneau
          6 Posts
          ACCEPTED ANSWER

          Re: LoadLeveler 3.4 and SMT Service Flash

          ‏2007-04-03T22:39:41Z  in response to SystemAdmin
          FYI, this flash is posted here:
          http://www14.software.ibm.com/webapp/set2/sas/f/hps/related/hps_flash_text.html#ll070227

          Have a happy!

          Glen Corneau
          IBM System p Advanced Technical Support
          • michael-t
            michael-t
            28 Posts
            ACCEPTED ANSWER

            Re: LoadLeveler 3.4 and SMT Service Flash

            ‏2007-04-16T21:47:52Z  in response to gcorneau
            Glen,

            does LL 3.3.2 have any "strange" interactions with SMT ON? Is it recommended to swtich to LL 3.4.2 in order to be able to get properly advantage of SMT?

            thanks!
            Mike

            > FYI, this flash is posted here:
            > http://www14.software.ibm.com/webapp/set2/sas/f/hps/re
            > lated/hps_flash_text.html#ll070227
            >
            > Have a happy!
            >
            > Glen Corneau
            > IBM System p Advanced Technical Support
  • michael-t
    michael-t
    28 Posts
    ACCEPTED ANSWER

    Re: LoadLeveler 3.4 and SMT.

    ‏2007-04-16T21:45:45Z  in response to astdenis
    Alain,

    besides the LL 4.3.00,1 issues, what is the observed performance with SMT ON for nodes with <= 16 runnable threads? We also have a Power5+ 575 cluster and users running POE/OMP code on the SMT enabled nodes complain that it takes much longer to finish the same code there (!)

    We have also observed that compute threads are unevenly dispatched out to log processors. Have you seen runnable threads being dispatched to the 2nday h/w thread even though the primary h/w thread is idle?

    Any bits of your experience would be appreciated !

    thanks
    Michael Thomadakis
    SC/TAMU
    • SystemAdmin
      SystemAdmin
      46 Posts
      ACCEPTED ANSWER

      Re: LoadLeveler 3.4 and SMT.

      ‏2007-04-18T02:03:41Z  in response to michael-t
      Hi Michael,

      Are you using the consumable resource feature in LoadLeveler to control the number of CPUs which are allocated for a task ? With SMT on, one would have to specify TWO consumable CPUs for a task, in order to get ONE physical processor for that task. Currently, there is no explict way in the LoadLeveler Job Command File to ask for physical core processor(s)for a parallel task. Any CPU specifcation in a job request is treated as a logical CPU request. One can still get a physical core processor allocated for a parallel task (if it is needed to get the best performance) through some configuration setup restriction, until a new feature for the SMT support is made available in Loadleveler in the future.

      Please send me email at waimanc@us.ibm.com and we can set up a call, if you would like to discuss this further.

      Regards,
      Waiman
    • astdenis
      astdenis
      11 Posts
      ACCEPTED ANSWER

      Re: LoadLeveler 3.4 and SMT.

      ‏2007-04-24T20:54:42Z  in response to michael-t
      Our users get a consistent performance when bindprocessor is used. One thing that will cause erratic performance is when 2 tasks end up running on the same core. But it depends on your codes and how they use the processor cache. I can try to obtain more details from the our users if you are interested.

      I don't know if there are plans to enable processor binding through LL jcf... Waiman?

      What do you mean by "nodes with <= 16 runnable threads"? Do you mean 16 threads are already used by a job and there are 16 available threads? Our users don't normally share nodes.

      I haven't observed the uneven dispatching of threads you mention although I would tend to think this is the default behavior. Somebody please correct me if I'm wrong.
      Alain
      MSC
      • SystemAdmin
        SystemAdmin
        46 Posts
        ACCEPTED ANSWER

        Re: LoadLeveler 3.4 and SMT.

        ‏2007-04-26T04:00:58Z  in response to astdenis
        Hi Alain,

        As discussed in a users meeting a couple months ago, we do have plan
        to put in more enhancement in this area to allow users to specify in the LL JCF that no 2 tasks using the same physical core.

        Please send me email if you would like to get more information on this, or we can talk more about this when we meet next time.
        Regards,
        Waiman
      • michael-t
        michael-t
        28 Posts
        ACCEPTED ANSWER

        Re: LoadLeveler 3.4 and SMT.

        ‏2007-05-29T23:38:05Z  in response to astdenis
        I have ran some experiments to investigate the performance difference of a
        compute-intensive workload using MPI (POE) running on a node with SMT ON and
        one with SMT OFF. I noticed a significant performance performance difference when we run the same identical code on nodes with STM ON vs those with SMT OFF.

        Specifically, I ran several interactive experiments with a compute intensive POE code with k processes (for k=2, 3, ..., 15, 16, .., 31, 32, and using the "shared-memory" POE option to avoid going to actual HPS network). This code carries out dense matrix computation and it exhibits scalability all the way to k=32. I compared the performance of this job running on a SMT OFF against the SMT ON node.

        As we increase k, a scheduling anomaly appears at around k=9 or k=10, at which point the SMT ON node starts taking 20-40% longer time than the node which has SMT OFF for the identical job. At k=17 of course the performance of the SMT OFF node reverses course and it behaves much less efficiently in relation to the SMT ON node. On all experiments for 1 <= k <= 9 both SMT ON and SMT OFF nodes finish the compuation at around the same time (+- a few seconds).The SMT ON behavior is a general trend (but there are occasional exceptions) for 9 < k <= 16 the SMT ON job takes less time.

        We have received a workaround by IBM to set the schedo tb_balance_s0=2 (by default is 0).

        This seems to have alleviated the scheduler's anomaly (likely dispatching two threads to the same core at times).

        To me a bigger issue is how to fine-tune the run time dispatcher to mostly extract the SMT ON benefits while mitifating its untoward effects. Unfortunately, I haven't been able to get info on how other schedo parameters affect scheduling heuristics vs SMT.

        • SystemAdmin
          SystemAdmin
          46 Posts
          ACCEPTED ANSWER

          Re: LoadLeveler 3.4 and SMT.

          ‏2007-06-04T04:46:54Z  in response to michael-t

          Hi Michael,

          Based on the information from your posting, on nodes with SMT on, you probably had two processes using the same physical core when you were running jobs with k processes where 9 <= k <= 16.

          With SMT on, there are 32 logical CPUs, even though you have only 16 physical processors on your Power5+ 575 node. If you are using consumable resource (CPU) and rset affinity features in LoadLeveler, the processes will be limited to run on the set of CPUs on the MCM assigned for the job.

          (If you can, please e-mail me the LoadLeveler config/admin file and the job command file used. I can verify if indeed that is the case.)

          If so, another 'workaround', besides setting schedo tb_balance_s0=2, is to specify the number of (logical) CPUs used by each process of the job to 2 in LoadLeveler. This will help to ensure that no more than one process of the job is trying to use the same physical core.
          Regards,
          Waiman
          • michael-t
            michael-t
            28 Posts
            ACCEPTED ANSWER

            Re: LoadLeveler 3.4 and SMT.

            ‏2007-12-21T17:34:14Z  in response to SystemAdmin
            Hello,

            sorry for not replying back to you on this. My center decided NOT to fool with SMT ON anymore (even though I think there is place for SMT ON).

            My initial suggestion was that since all cluster support s/w (LAPI/POE, GPFS, rsct, etc.) is so heavily multi-threaded, running with SMT ON on a system running HPC applications on all available processors (16 in our case) may benefit by minimizing the impact the support threads on computation threads. Unfortunately, the silly default AIX dispatcher heuristic to maintain affinity at the core level among h/w SMT threads vs allocating runnable threads to free cores, created the impression to some folks here, that SMT is not handled properly by AIX (and that IS the case with tb_balance_s0=0)

            Nevertheless, I still stand by my initial suggestion that power5 clusters can benefit with SMT ON. However, LL should STILL see only 16 processors / node and not 32.

            On another issue, I was wondering if MPI/LAPI FIFO communication mode (non RDMA) can leverage multiple SNI adapters. We have 2 SNIs / 575 node but it appears that FIFO LAPI communication does not use both of them regardless of
            MP_EUILIB=us, MP_EUIDEVICE=sn_all or
            network.MPI = sn_all,not_shared,US,HIGH

            Is sn_all supported in non RDMA LAPI ? I.e., round robin use of the SNIs as FIFO packets are sent out.

            thanks

            Michael