I thought I should try this forum before I opened a sev 4 pmr. Let's see.
We installed LL 3.4 yesterday. The new smt keyword with its default set to "no" caught us by surprise: all our MPI jobs took a significant hit in performance. We worked around this issue by setting smt = yes in the default class. But we don't understand the behaviour. We have a p5+ p575 cluster. Our users run jobs that do not use more than 16 cpus. Smt = no should mean "do not run on the secondary thread", so on a p5+ p575 16 cores node, a 16 thread job should run ok, right? Unless something causes it to bind to 8 cores and their secondary thread. Anybody else experienced something similar?
We activated smt = yes by modifying the default class and then running "llctl -g reconfig". The effect was immediate: the performance hit disappeared. Now, users who are trying to bind threads to processors are seeing inconsistent behaviour. The processor numbering scheme escapes us. For exemple, "bindprocessor -s 0" reports
The available processors are: 2 3 5 7 8 1 12 4 15 13 11 9 14 10 6 0
on most of our nodes. We rebooted a node and it reverted to:
The available processors are: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
What is causing this re-ordering? Our initial attempts at using bindprocessor have always assumed even numbers were the primary threads and odd were the seconday threads. The numbering shown above makes it difficult to programmatically bind to the "right" processors.
Today, a user ran a small openmp job both within LL and interactively. He ran it with and without processor binding. The initial results show that the interactive runs perform much better. What could cause this?
At SP-XXL two weeks ago, Waiman Chan asked for some feedback on the smt control in LL 3.4. Good. Could we get a detailed explanation on what is the current smt control implementation in LL? Our feedback would then be much more meaningful, I'm sure.
This topic has been locked.
12 replies Latest Post - 2007-12-21T17:34:14Z by michael-t
Pinned topic LoadLeveler 3.4 and SMT.
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2007-12-21T17:34:14Z at 2007-12-21T17:34:14Z by michael-t
Re: LoadLeveler 3.4 and SMT.2007-02-22T17:55:28Z in response to astdenisAlain,
LoadLeveler should not be changing the SMT mode, which you have already set up for the system before starting LoadLeveler.
There was a defect in LoadLeveler to address this issue, which is available in PTF 2.
What PTF version of LoadLeveler 3.4 did you install in your nodes ?
If the scenario happens with PTF 2 or later version, we will need to do some more investigation on this issue.
Changing the SMT mode could change the processor numbering scheme on a node.
For those who depend on the numbering scheme to do bindprocessor to a task,
switching SMT on/off needs to be avoided.
We are looking into putting out a flash to help others to be aware of the above.
Re: LoadLeveler 3.4 and SMT.2007-02-22T18:05:01Z in response to SystemAdminHi,
We are running 126.96.36.199. Are you saying that 188.8.131.52 does switch smt on and off and this is fixed in 184.108.40.206? We know smt is enabled on all our nodes. I'v been monitoring the smt status and it does switch back and forth between enabled and disabled on what seems to be random nodes.
Re: LoadLeveler 3.4 and SMT.2007-02-23T17:08:16Z in response to astdenisHi Alain,
Prior to LL 220.127.116.11, the SMT mode was mistakenly set to 'no' for the job at job start, when nothing related to SMT was specified in the job or in the stanza. And whenever the SMT mode is changed on a node, the processor numbering scheme can also be affected. The problem was fixed in 18.104.22.168. When you restart the node with AIX and LL 22.214.171.124, the SMT mode will not be modified, even when nothing SMT related is specified in the job/stanza. And you will not see the processor numbers issue any more.
A flash will be put out shortly to warn other users, who are going to migrate to LL V3.4 and running with SMT on.
Sorry for any inconvenience.
gcorneau 0100003X4E6 PostsACCEPTED ANSWER
Re: LoadLeveler 3.4 and SMT Service Flash2007-04-03T22:39:41Z in response to SystemAdminFYI, this flash is posted here:
Have a happy!
IBM System p Advanced Technical Support
Re: LoadLeveler 3.4 and SMT Service Flash2007-04-16T21:47:52Z in response to gcorneauGlen,
does LL 3.3.2 have any "strange" interactions with SMT ON? Is it recommended to swtich to LL 3.4.2 in order to be able to get properly advantage of SMT?
> FYI, this flash is posted here:
> Have a happy!
> Glen Corneau
> IBM System p Advanced Technical Support
Re: LoadLeveler 3.4 and SMT.2007-04-16T21:45:45Z in response to astdenisAlain,
besides the LL 4.3.00,1 issues, what is the observed performance with SMT ON for nodes with <= 16 runnable threads? We also have a Power5+ 575 cluster and users running POE/OMP code on the SMT enabled nodes complain that it takes much longer to finish the same code there (!)
We have also observed that compute threads are unevenly dispatched out to log processors. Have you seen runnable threads being dispatched to the 2nday h/w thread even though the primary h/w thread is idle?
Any bits of your experience would be appreciated !
Re: LoadLeveler 3.4 and SMT.2007-04-18T02:03:41Z in response to michael-tHi Michael,
Are you using the consumable resource feature in LoadLeveler to control the number of CPUs which are allocated for a task ? With SMT on, one would have to specify TWO consumable CPUs for a task, in order to get ONE physical processor for that task. Currently, there is no explict way in the LoadLeveler Job Command File to ask for physical core processor(s)for a parallel task. Any CPU specifcation in a job request is treated as a logical CPU request. One can still get a physical core processor allocated for a parallel task (if it is needed to get the best performance) through some configuration setup restriction, until a new feature for the SMT support is made available in Loadleveler in the future.
Please send me email at firstname.lastname@example.org and we can set up a call, if you would like to discuss this further.
Re: LoadLeveler 3.4 and SMT.2007-04-24T20:54:42Z in response to michael-tOur users get a consistent performance when bindprocessor is used. One thing that will cause erratic performance is when 2 tasks end up running on the same core. But it depends on your codes and how they use the processor cache. I can try to obtain more details from the our users if you are interested.
I don't know if there are plans to enable processor binding through LL jcf... Waiman?
What do you mean by "nodes with <= 16 runnable threads"? Do you mean 16 threads are already used by a job and there are 16 available threads? Our users don't normally share nodes.
I haven't observed the uneven dispatching of threads you mention although I would tend to think this is the default behavior. Somebody please correct me if I'm wrong.
Re: LoadLeveler 3.4 and SMT.2007-04-26T04:00:58Z in response to astdenisHi Alain,
As discussed in a users meeting a couple months ago, we do have plan
to put in more enhancement in this area to allow users to specify in the LL JCF that no 2 tasks using the same physical core.
Please send me email if you would like to get more information on this, or we can talk more about this when we meet next time.
Re: LoadLeveler 3.4 and SMT.2007-05-29T23:38:05Z in response to astdenisI have ran some experiments to investigate the performance difference of a
compute-intensive workload using MPI (POE) running on a node with SMT ON and
one with SMT OFF. I noticed a significant performance performance difference when we run the same identical code on nodes with STM ON vs those with SMT OFF.
Specifically, I ran several interactive experiments with a compute intensive POE code with k processes (for k=2, 3, ..., 15, 16, .., 31, 32, and using the "shared-memory" POE option to avoid going to actual HPS network). This code carries out dense matrix computation and it exhibits scalability all the way to k=32. I compared the performance of this job running on a SMT OFF against the SMT ON node.
As we increase k, a scheduling anomaly appears at around k=9 or k=10, at which point the SMT ON node starts taking 20-40% longer time than the node which has SMT OFF for the identical job. At k=17 of course the performance of the SMT OFF node reverses course and it behaves much less efficiently in relation to the SMT ON node. On all experiments for 1 <= k <= 9 both SMT ON and SMT OFF nodes finish the compuation at around the same time (+- a few seconds).The SMT ON behavior is a general trend (but there are occasional exceptions) for 9 < k <= 16 the SMT ON job takes less time.
We have received a workaround by IBM to set the schedo tb_balance_s0=2 (by default is 0).
This seems to have alleviated the scheduler's anomaly (likely dispatching two threads to the same core at times).
To me a bigger issue is how to fine-tune the run time dispatcher to mostly extract the SMT ON benefits while mitifating its untoward effects. Unfortunately, I haven't been able to get info on how other schedo parameters affect scheduling heuristics vs SMT.
Re: LoadLeveler 3.4 and SMT.2007-06-04T04:46:54Z in response to michael-t
Based on the information from your posting, on nodes with SMT on, you probably had two processes using the same physical core when you were running jobs with k processes where 9 <= k <= 16.
With SMT on, there are 32 logical CPUs, even though you have only 16 physical processors on your Power5+ 575 node. If you are using consumable resource (CPU) and rset affinity features in LoadLeveler, the processes will be limited to run on the set of CPUs on the MCM assigned for the job.
(If you can, please e-mail me the LoadLeveler config/admin file and the job command file used. I can verify if indeed that is the case.)
If so, another 'workaround', besides setting schedo tb_balance_s0=2, is to specify the number of (logical) CPUs used by each process of the job to 2 in LoadLeveler. This will help to ensure that no more than one process of the job is trying to use the same physical core.
Re: LoadLeveler 3.4 and SMT.2007-12-21T17:34:14Z in response to SystemAdminHello,
sorry for not replying back to you on this. My center decided NOT to fool with SMT ON anymore (even though I think there is place for SMT ON).
My initial suggestion was that since all cluster support s/w (LAPI/POE, GPFS, rsct, etc.) is so heavily multi-threaded, running with SMT ON on a system running HPC applications on all available processors (16 in our case) may benefit by minimizing the impact the support threads on computation threads. Unfortunately, the silly default AIX dispatcher heuristic to maintain affinity at the core level among h/w SMT threads vs allocating runnable threads to free cores, created the impression to some folks here, that SMT is not handled properly by AIX (and that IS the case with tb_balance_s0=0)
Nevertheless, I still stand by my initial suggestion that power5 clusters can benefit with SMT ON. However, LL should STILL see only 16 processors / node and not 32.
On another issue, I was wondering if MPI/LAPI FIFO communication mode (non RDMA) can leverage multiple SNI adapters. We have 2 SNIs / 575 node but it appears that FIFO LAPI communication does not use both of them regardless of
MP_EUILIB=us, MP_EUIDEVICE=sn_all or
network.MPI = sn_all,not_shared,US,HIGH
Is sn_all supported in non RDMA LAPI ? I.e., round robin use of the SNIs as FIFO packets are sent out.