We have a 40 p5+575 1600 cluster connected by two planes of HPS switches. Each node has 16 p5+ with 32GB of RAM and two SNIs. We run ppe at 4.3.1.X and rsct.lapi at 2.4.4.X levels and AIX 5.3 TL06 SP1
Reading the PE, LL and rsct.LAPI manuals we can see that when multiple SNIs are present (we have 2/node) setting
MP_EUIDEVICE=sn_all for interactive POE or
network.MPI=sn_all for batch LL jobs
we enable striping across all SNI adapters.
1) striping witn FIFO mode LAPI over both SNIs: Even though these manuals clearly state that there are tangible bandwidth benefits with RDMA mode and SN_ALL (i.e., striping of bulk messages) the discussion for the FIFO LAPI mode is not definitive. It is mentioned that LAPI FIFO has no bandwidth advantages by striping the same (and smaller) FIFO POE messages but it is not clearly stated if striping is done (even if with NO bandwidth benefits) or not at all.
2) Load-Sharing of boh SNI adapters in LAPI FIFO mode: It is clear that striping each single small POE message across two SNIs would be an overkill, but how about when the application is sending out multiple smaller messages ? Selecting in a Round-Robin fashion the output SNI adapter would allow multiple messages to be sent out doing at least load-sharing on the different HPS networks. Does FIFO mode use *all SNI*s when SN_ALL or NOT? That is would FIFO select just ONE SNI to send out messages even though one has requested SN_ALL?
3) Multiple SNIs and RDMA LAPI mode: Using "bulk transfer" (RDMA) by
MP_USE_BULK_XFER=yes (interactive) or
#@bulkxfer=yes (batch LL)
POE code, instructs LAPI to use RDMA + striping for POE messages with size >=MP_BULK_MIN_MSG_SIZE.
For messages with size < MP_BULK_MIN_MSG_SIZE LAPI uses FIFO mode. How are the FIFO mode messages transfered over the multiple SNIs? That is, is FIFO for these type of messages handled in the same way as when LAPI has NOT been asked to use RDMA (see questions 1 and 2 above).
The use of SN_SINGLE or SN_ALL is critical since, the latter one allocates at least one US adapter window on EACH SNI, but if FIFO cannot leverage multiple SNI adapters (striping and/or RR Load Sharing) it is very wasteful. Allocating multiple window resources on all SNIs when just a single adapter is going to be used does not make sense at all when preemption is possible and either one of the (preempting or preempted) jobs is allocated all availble windows on all adapters.
4) RDMA with "small" messages: Is it worthwhile lowering MP_BULK_MIN_MSG_SIZE if a POE application has a lot of aynchronous messages send/received ? In the sense that RDMA would OFF-load the processors from LAPI protocol processing and data copying
so more processor time can be left for computation in the application?
5) If applications are allocated all processors in a node for computation, is there any benefit in having SMT_ON for those times? Due to the heavy multi-threading of all the clustering support s/w (LAPI, rsct, GPFS, etc) setting SMT_ON on those cluster nodes with all processors already allocated to compuation workload, the occasional running of an rsct or GPFS thread may impact LESS the running computing thread. Otherwise, a compute-intensive thread would be preempted , the system s/w thread switched in and then later the application thread would have to be switched back in. I feel that a compute thread context switching is far more expensive than interference from the 2nd h/w thread.
thanks in advance for any insightful answer ...
Pinned topic LAPI over HPS with multiple SNIs: Striping with FIFO or RDMA Modes
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2008-02-08T20:28:56Z at 2008-02-08T20:28:56Z by michael-t
michael-t 120000PNCE28 Posts
Re: LAPI over HPS with multiple SNIs: Striping with FIFO or RDMA Modes2008-02-08T20:28:56ZThis is the accepted answer. This is the accepted answer.
- HPC_Central 0600028D5U
we had some very interesting rounds of useful tech exchanges with several people in the HPS development team under PMR 86969,004,000.
I am really satisfied at the technical information I received (which is not readily available in documentation or papers).
There are still a few bits and pieces which are unanswered but I didn't want to abuse these peoples' time.
I am still not clear on the congestion-control scheme of the HPS fabric, but I think that I can infer from my read ups and replies that:
- basically LAPI on each endpoint packetizes messages into 2KB HPS link layer packets and sends them off for routing by the HPS;
- it follows a window-based congestion control scheme were it initially sends out 1, 2, 4, 8 packets and waits for ACK; I think that it subsequently sends out 8-packets at a time; if it times-out before ACK it retransmits; I do not know if MP_ACK_THRESH is the delayed ACK threshold in LAPI at packet level in the receiving end; I am not sure (but I guess yes) if MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL are used by receiving LAPI to pace the polling of packet arrivals; again there is no explanation of a global congestion control scheme which I assume does not exists; that would bring the issue of how can I monitor the switch traffic for congestions and packets basically stalled in the SCs central buffer memory
- there is link-by-link flow control at the flit level ("backpressure" mechanism) from OutPort to downstream InPort and there is enough buffer space at each InPort to keep a full pipeline of 2KB packets given the current BW X delay of each link;
- I was told the SwitchChip 4 (current version) has 32 byte flits (Switch3 had 4 byte flits byt 32byte "chunks" so I am not sure if there is a confusion between the two, but this is not so important)
- it is not clear to me for FIFO and sn_all mode and a sender with multiple destinations (all accessible by all switch planes) how it is determined which plane to use for the current packet transmission; that could help us come up with guidelines on hoe to request HPS resources by application profile type;
- another question would be how a POE application can reuse currently allocated resources (mem buffers or SNI resources) so that the DD wont have to keep realloacting and re-attaching memory to user address space anew with each new message transmission; a relevant Q is how cache resident sets can be leveraged so that, say, MPI communication finds the data more often than ot in the cache and avoid crossing the mem bus;
- other issues deal with "affinity" how is it best to allocate POE tasks wrt memory ownership and physical SNI GX bus attachement?
I clearly understand that all my Q (outside the HPS internals) are application dependent and warrant further research. I am currently working on thesw questions and I hope I will have results which can be presented to the scientific / developer / user community.