When I get to 1200 or so virtual machines in an LPAR with 8GB of RAM,
there is a high degree of memory overcommitment going on. This means
that any change to the steady-state of the environment causes a spike of
virtual memory activity in z/VM, as pages get thrown around to
accommodate the workload change. This is exactly the situation I
encountered when I wanted to view the Ganglia details of a server cage
-- since the gmetad host was sitting in the same LPAR as all the guests,
the change in state caused by drilling into the Ganglia web interface
on the cage gmetad host caused a heap of z/VM paging activity. Worse
was the delay in getting the page to come up; with a large number of
drones I'd be waiting up to a minute for z/VM to shuffle enough space
for Linux to get Apache into storage so it could respond.
I had already moved some of the administrative workload into another LPAR, but figured I had to leave the cage gmetads in the same system since they had to be on the same network as the drones. Then I realised that with a shared OSA and creative VSWITCHing I could put even the cage gmetad hosts into the other LPAR. Over the course of a couple of days (and one trashed Linux system thanks to me being a little careless with disk sharing) I had the cage gmetad hosts moved to the other LPAR and I was ready to test.
Things started poorly. The drones did not get DHCP to start with (since the drones are effectively diskless, I use DHCP to get IP addresses into them). The cage gmetad hosts run dhcp-helper, a brilliant DHCP relay agent  by Simon Kelley (of dnsmasq fame) and also act as routers for the cage drones. Once I realised that the environment's DHCP server wasn't running, I started it and tried again. Still no luck, but I realised that the DHCP server was giving out the wrong address as the default gateway. Fixed that, but still no luck: bizarrely, it seemed as if the drones could not contact the gmetad server -- which called into question my approach of using the shared OSA.
At this point, a bit of a review of the network configuration might help. Each of the cages has its own network, as much for manageability as for scalability. In a startling bit of forethought (for me), even though these cage networks had no external connection I used a VSWITCH for each rather than a Guest LAN. I used a Hipersockets CHPID to get from the LPAR where the drones live to the LPAR where the admin systems live, but now since all the systems attached to the Hipersockets are actually in the same LPAR I could use something else. Moving the cage gmetad hosts between LPARs was a matter of assigning each cage VSWITCH a VLAN ID and attaching it to an OSA, then in the other LPAR creating a new VSWITCH attached to the same OSA and allowing each cage gmetad host to attach to the right VLAN for its cage VSWITCH in the other LPAR. Using the so-called "OSA fastpath" (the feature of the OSA Express adapters that sends traffic destined for a system sharing the OSA directly to that system without it going over the wire to the LAN switch) would mean that the drones and their gmetad host would still have very efficient communication within the system. That was the plan...
...but it wasn't working. For some reason the systems in the first LPAR couldn't contact the gmetad host -- but they had successfully obtained a DHCP lease, which used the very same path! The fact that the DHCP had worked confused me that much more. I started doing ARP pings and checking ARP caches (I'm running the VSWITCHes and OSAs in Layer 2 mode to make DHCP a little easier), all the while starting to wonder if the OSA sharing didn't work the same way in Layer 2 as Layer 3 mode. I was watching a tcpdump capture against the cage gmetad host's virtual OSA run by when the answer came to me:
15:34:11.104769 arp who-has zgn2c101.zgen2.stg.ibm tell zgn2c11a.zgen2.stg.ibm
15:34:11.104805 arp reply zgn2c101.zgen2.stg.ibm is-at 02:00:00:00:00:0a (oui Unknown)
15:34:13.019845 arp who-has zgn2c101.zgen2.stg.ibm tell zgn2c11c.zgen2.stg.ibm
15:34:13.019856 arp reply zgn2c101.zgen2.stg.ibm is-at 02:00:00:00:00:0a (oui Unknown)
15:34:15.015587 arp who-has zgn2c101.zgen2.stg.ibm tell zgn2c11d.zgen2.stg.ibm
15:34:15.015606 arp reply zgn2c101.zgen2.stg.ibm is-at 02:00:00:00:00:0a (oui Unknown)
15:34:16.008693 arp who-has zgn2c101.zgen2.stg.ibm tell zgn2c113.zgen2.stg.ibm
15:34:16.008709 arp reply zgn2c101.zgen2.stg.ibm is-at 02:00:00:00:00:0a (oui Unknown)
I had a MAC addressing overlap between my two z/VM systems. The virtual networking functionality of z/VM allocates MAC addresses to virtual NICs when they are created. By default z/VM will create a MAC address for the NIC, or you can allocate it. The trick here is that even if you allocate the MAC, you only get control over the last three bytes of the total 6 bytes of the MAC address -- the first three bytes are set by z/VM, and the default is "02:00:00". So when I moved some of the systems into the other LPAR, I was subject to both my z/VMs allocating MAC addresses in the same range, and sure enough my first cage gmetad host got the same MAC address as a system already running in the network. It wasn't a problem before since all the guests running in Layer 2 mode were all in the same LPAR, but moving some of them into a different LPAR created the problem.
To fix it, I edited SYSTEM CONFIG on the drone z/VM LPAR and set the MAC prefix:
VMLAN MACPREFIX 020100
Unfortunately it can't be changed on the fly so I had to IPL. I then restarted a group of drones... and they still didn't work! WTF! I checked the MAC prefix had changed, verified the guests were registering the right MAC to the OSA... I was really running out of ideas now. I decided I had to use OSA/SF to see what the OSA MAC table actually looked like, thinking that MAC addresses weren't getting registered. The output of IOACMD 6 "Get OSA Address Table" showed me that the MAC for the cage gmetad host wasn't in the OSA -- the guest had failed to register its MAC when it came up, because another system already had that MAC. Stopping and starting that interface fixed the MAC registration, and then everything started working as expected.
I used to regard setting the MAC prefix as optional -- although I have used it in the past for special configurations, I didn't think it was really necessary. It certainly isn't necessary when running OSAs and VSWITCHes in Layer 3 mode, where everything hides behind the UAA of the physical OSA port. This situation has shown me that VMLAN MACPREFIX should be set in SYSTEM CONFIG any time you are using Layer 2 in z/VM -- you just never know where those MAC addresses are going to show up!
 I started using ISC dhcrelay, since it came in with the rest of the ISC DHCP stuff in the "net-misc/dhcp" ebuild.
However, I started getting weird DHCP loops that seemed to be because
dhcrelay was forwarding the responses from the DHCP server back to the
server again. Sure enough, it's a known problem;
dhcrelay has to listen for DHCP requests on the interface used to
communicate to the server, which makes it confused by the messages
coming back to it from the server. Simon Kelley's dhcp-helper doesn't suffer the same issue, so it's got the nod.