Local, Near, Far part 11 - Why Local+Far on Lower End machines?
nagger 100000MRSJ Visits (5889)
I have been wondering why the lower end POWER7 machines have local and far memory and not local and near. Perhaps you wondered too! Well at the Miami Power Technical University, I got to talk to Dr Joel Tendler (IBMer) and a POWER7 processor guru and put the question to him. He covered this sort of architectural topic in his presentation at the event and I learnt a lot in this area by listening to the "master". Below is some background and the explanation too.
The POWER7 chips has two memory controllers for maximum performance but they are only both used on the top end machines = Power 770/780 and Power 795. On the other POWER7 machines, a single memory controller is uses as a single one can provide all the memory band-width they need. This is running at around 68 GB per second (yes that is Giga-Bytes) per controller. If you use both controllers you get double this in total. In theory, it can load into the lower end machine's processors the entire contents of a 256 GB machine in under 4 seconds. In practice, this is far more than needed in running a regular workload but fast memory band-width or rather very low latency does aid performance (less time for the CPU waiting for memory to catch up).
As we have covered in this series, sometimes the memory is not local but connected to a different POWER7 chip. When a process (program) has to access non local memory it needs to first check it is not cached somewhere and if not, read it from the POWER7 chip that has the memory controller for the RAM. At the Technical University, I also attended a session from Jeff Stuecheli on POWER7 Cache Coherence i.e. the protocols for determining the best copy of a cache line, how to lock it for write access and how to ensure the update gets back to the memory - he is one of the design team. Well, I can't say I understood a lot of it (may be 10%) as it was very advanced deep stuff but the impressions I got was "it is way more complicated than I thought but we really have loads of very clever people working on the world leading technology. It explains how we get to 256 way machines with good performance". Perhaps, I will re-attend his session in the Copenhagen Technical University, next week and try to remember 20% of it this time :-)
On a practical side, each POWER7 has what is called a fabric bus controller (FBC) built into the chip which is used to communicate with other POWER7 chips for control (finding cache lines and access modes) and for transport (moving the cache line to the POWER7 needing it).
If you want to know more and want a diagram, take a look at the Power 795 Technical Overview and Introduction Redbook section 2.2.1.
- Power 795 Redbook direct link
Briefly, there are five fabric ports which can be configured for 8-byte, 4-byte or-2 byte width. The five fabric ports are grouped in to
In diagram form, a single POWER7 has five buses as below:
Lower End machine are connected using the AB bus as below (the Power750 has four POWER7 chips but you can imagine the configuration easily enough):
The High End machines are connected as below (the Power 770//780 is simpler as it has only two POWER7 chips per CEC drawer/node):
I am also told the lower POWER7 model machines use the 4-byte bus width. This is largely because with only a few POWER7 chips (one or two in the POWER7 blades, Power 710 to Power740 and four POWER7 chips in the Power 750) the higher bandwidth is not required and there much fewer remote processors. It also means less tracks across back plane or for the Power750 across the CPU connectors from the CPU cards. I guess this also controls costs and complexity (better reliability). On the top end machines where the chances of remote processors grows exponentially, they have the full 8-byte buses and the two tiers (XYZ plus AB) to maximise the bandwidth and this explains the excellent scaling we have on the top end machines - in addition to double the number of memory controllers per POWER7 chip (64 memory controllers in the Power 795 = 4.3 TB per second is fairly mind boggling).
This is also reflected in the way the POWER7 chips are mounted (if that is the right term) on to the carrier to be inserted in to the machine. My UK colleague Christopher Hales is giving a presentation on some of the science and technology involved at the Copenhagen conference next week. The Power 775 use a multi-chip module but the other machines in the range from the Power Blade to the Power 795 use the following two technologies:
The speeds of the AB and XYZ buses varies between models and the POWER7 GHz rating. I have seen no clear documentation covering this in detail for each model as it is largely an internal feature (we don't really need to know and there is nothing we can do about it, if we did know). If you spend a few hours researching on the web, you find numbers for the XYZ bus between 40 to 50 GB/s and for the AB bus from 23 to 26 GB/s. All massively more than you need but this also minimises the latency and helps explains the scalability.
I think this is the last blog in this series - but who knows what we will learn at the next Power Technical University.