Tuning Virtual Ethernet adapters for (even) better backup performance.
cggibbo 270000TMUJ Comments (23) Visits (70827)
Are you backing up your AIX systems over Virtual Ethernet adapters? Of course you are, who isn’t right? Are your backup server and clients on the same physical POWER system? You are most likely backing up over Virtual Ethernet to another AIX LPAR that is running your enterprise backup software, such as TSM or Legato Networker for example. And you probably have a dedicated private virtual network (and adapters) on both the clients and the server to handle the traffic for the nightly backups. The next question is, have you tuned your Virtual Ethernet adapters?
There are several tips available for tuning your Virtual Ethernet adapters for better performance on AIX. These tips include changing settings such as MTU size, TCP window sizes, enabling largesend, etc. I highly recommend the following blog posts from Anthony English and Nigel Griffiths on this subject:
OK, so you got everything humming along nicely, your backups are flying over the virtual network (across the POWER hypervisor) and everybody is happy. After a period of time, you notice that the backups have started to “slow down”. They are taking longer to finish. The overall throughput of a backup drops. Some backups start in the evening around 9pm and are still running the next morning at 7am! In some cases you need to kill the backups or even reboot the backup server LPAR for things to return to normal.
“What is going on!?” You cry.
Well, there are a number of reasons why this could be happening. For example, your shared processor pool may be overwhelmed during the backup window. As we know, Virtual Ethernet adapters require CPU to do their work. If the CPU pool is running low on available CPU resources, this could contribute to the problem. And of course there could be tuning issues with the Virtual Ethernet adapters or the AIX OS in general. Or there may be issues with other pieces of the infrastructure, like network and SAN switches, adapters, etc. Perhaps there’s an issue with the applications and/or databases on the AIX systems? They often have their own mechanisms/tools for backing up their data to your enterprise backup software. Is the backup server sized to cope with the load i.e. CPU, memory, disk layout and I/O, sufficient tape drives, disk storage pools, etc?
So assuming you’ve checked all of the above (and more), then perhaps you’ve hit a problem that I encountered recently. In my particular case, backups “over the hypervisor” were slowing down, without any discernible cause. Initially the backups would be “very fast” but after a month or so, things would start to slow down dramatically.
We noticed that there were very
large (and increasing) values for “Packets
Dropped”, “Hypervisor Send/Receive
Failures” and “No Resource Errors”
in the output from the netstat –v
STATISTICS (ent1) :
Virtual I/O Ethernet Adapter (l-l
Time: 42 days 4 hours 3 minutes 34 seco
5978589961 Packets: 2613
0 Interrupts: 6804
Errors: 0 Receive Errors: 0
Dropped: 0 Packets Dropped: 8601
Max Collision Errors: 0 No Resource Errors: 46113807
Send Failures: 0
Receiver Failures: 0
Send Errors: 0
Hypervisor Receive Failures: 4611
After some discussion with IBM AIX
support, we discovered that would should increase some of the buffer sizes for
our Virtual Ethernet adapter (the entX device). This would alleviate the no
resource issues we’d been experiencing. Looking at the output from the netstat -v command, we also noticed
that the Medium, Large and Huge buffers had
all reached their maximum values in the past
Buffer Type Tiny Small Medium Large Huge
Max Allocated 576 951 256 64 64
Lowest Registered 502 502 64 12 11
The advice from IBM support was to increase these buffers using the chdev command (they also advised that we should reboot for the changes to take effect):
# chdev -l ent1 -a min_buf_medium=512 -a max_buf_medium=1024 –a min_buf_large=96 -a max_buf_large=256 -a min_buf_huge=96 –a max_buf_huge=128 -P
# shutdown -Fr
Since implementing this tuning change (to the adapter on the backup server), we have not had a repeat of the problem. We will continue to monitor the performance and I’ll be sure to let everyone know if we have further issues.