IBM Support

Performance study for Load Balancer

White Papers


Abstract

Instructions required to handle heavy traffic with the IBM WebSphere Application Server Edge Components Load Balancer and the NAT forwarding method.

Content

The participants in the study were:
  • Lori Adington
  • Uttam Deedwani
  • Robert Hadrick
  • Steve Pelton
  • Betsy Riggins
  • Shelton Skinner

  1. Purpose
  2. Basics
    1. Stress Tools
    2. Load Balancer
  3. What is the maximum throughput I can expect?
    1. Stress tool tuning
    2. General server tuning
    3. Load Balancer tuning
    4. High availability considerations
  4. Addendum
  5. Reference




Purpose
The purpose of this document is to help you achieve the throughput that you require in a networked environment with IBM Load Balancer. This document provides background information, configuration examples, and the measured throughput from lab tests. This document does not include a formal performance study, but the information is valuable for anyone that uses Load Balancer.

The tests for this study were performed with Load Balancer Version 6.0.2.54 that was running on AIX® Version 5.3. The information found within this document remains valid through the current versions of both the Load Balancer for IPv4 and the Load Balancer for IPv4 and IPv6 and for current operating systems. Default OS values and OS command syntax differ with different releases of the operating system.

Basics
This study involved HTTP traffic, which uses Transmission Control Protocol (TCP).

For a simple HTTP request, 9 packets are involved in the request cycle:
  1. The client sends a SYN to start the connection request.
  2. The server receives the SYN, and sends a SYN|ACK to the client.
  3. The client acknowledges the SYN|ACK transmission.
  4. The client sends a request.
  5. The server sends a response.
  6. The connection is closed (4 packets).

A normal close cycle requires 4 packets and can be initiated by either the client or the server:
  1. "A" sends FIN to "B."
  2. "B" receives FIN and sends FIN|ACK to "A."
  3. "A" acknowledges receipt to "B."
  4. "B" acknowledges receipt to "A."


Stress Tools
Stress tools are diagnostic tools, which generate a high number of connection requests. For example, Microsoft® offers a free stress tool called the Microsoft® Web Application Stress Tool, and the customer for this case was using HP® Load Runner software. For this test, we used an IBM® internal tool to generate the connection requests.

Many stress tools do not properly close a TCP connection, but instead the tools send a reset command (RST) to close the connection with 1 packet instead of 4 packets. The tools send the reset command to improve the capacity of the stress tools. This change makes little difference on the performance of Load Balancer.


Load Balancer
Load Balancer is a product that distributes TCP/IP requests to multiple servers, but it is available as a single IP address for external clients to target. Load Balancer allows sites to scale from 1 to "n" servers transparently and without visible changes to external clients.

As an administrator of Load Balancer, you can choose between the following forwarding methods, which is the single largest factor related to performance:
  • Media Access Control (MAC) forwarding.
  • Network Address Translation (NAT) forwarding.
  • Content-Based routing, which is referred to as kCBR to distinguish this method from the Content-Based Routing component of Load Balancer. (This forwarding method is not applicable for the Load Balancer for IPv4 and IPv6)

MAC forwarding
The MAC forwarding method performs best in most cases and clients. This method is preferred whenever possible. With MAC forwarding, the back-end servers reside on the same subnet as the Load Balancer machine. Load Balancer directs packets to the server by changing the MAC destination address in the packet.

By design, MAC traffic cannot cross a network router, so the database servers must be on the same subnet. Load Balancer does not rewrite the MAC source address of the packets, so the traffic from the server to the client goes directly to the client and is not routed through Load Balancer. Locating the servers on the same physical subnet and the direct return route of server-to-client packets by Load Balancer significantly reduces network latency. Network latency is the time for a packet to traverse from sender to receiver. As a result, MAC forwarding is the best-performing forwarding method.

NAT forwarding
Typically, the Network Address Translation (NAT) method is the forwarding method that performs the best after the MAC forwarding method. With the NAT method, the servers do not have to be on the same subnet, but you can locate the servers on the same subnet without modifying their configuration. This forwarding method allows multiple instances of the back-end server software on the same server. Then, you can bind each instance to a unique IP to achieve higher performance utilization on the back-end servers.

With NAT forwarding, the client sends a request to the cluster address for Load Balancer. Load Balancer then opens a new connection between the return address for the Load Balancer machine and the back-end server. Load Balancer keeps a record of the association between the client request and the back-end server connection. It serves as a proxy for the data transfer between the connections. Therefore, the NAT forwarding method uses two connections for each client request. Connection counts that are shown on reports for a NAT forwarding port reflect twice the number of client connections that are present. The extra connections decrease the performance of Load Balancer when it is compared to the MAC forwarding method.

The NAT forwarding method has greater request latency than the MAC forwarding method because the return traffic from the server to the client must pass through Load Balancer. Traffic that is going through Load Balancer experiences latency that is equal to the latency of crossing a router. Many back-end servers are not on the same subnet, too, which would further add to the overall latency. For each router that the traffic must traverse, estimate an increase in the latency that is equal to specifications of the router.

The NAT forwarding method is the method that we studied in this case study. The tests involved a single instance of the server software running on each back-end server.

Content-based routing
The Content-Based Routing method, which is commonly referred to as kCBR to avoid confusion with the Content-Based Routing component of Load Balancer, is exactly like the NAT forwarding method. However, unlike the NAT forwarding method, you can define rules that are based on the content in the HTTP request. These rules determine how Load Balancer responds, which means for each HTTP request Load Balancer examines the contents of the packet. This approach decreases the maximum number of packets that Load Balancer can forward because this function requires more CPU usage. As you might expect, processing time increases for each rule that is defined.

Define the more commonly matched rules with a lower priority so that they are matched quickly and you can avoid excessive processing. Load Balancer buffers the packets that are sent by the client in memory until a rule is matched or Load Balancer encounters an end of HTTP headers. Monitor the kernel memory when you use this forwarding method to ensure that sufficient memory is available.

Note: The kCBR forwarding method works with SSL or HTTP traffic.


What is the maximum throughput I can expect?
Determine how much traffic the network can handle. Normally, the largest packet is the server response to the HTTP request. This response is the page that is returned to the browser.

In this test, the response size was limited to 1KB, which includes the HTTP response, the TCP wrapper, and the Ethernet wrapper. The intent was to reach a target of 3500 requests per second and determine whether the network bandwidth was sufficient.

In this case, consider the calculation of the total size of 3500 messages, each 1KB:
  • 3500 messages/sec * 1KB/message * 1024B/KB * 8b/1B * 1Mb/1x10**6 = 28.672Mb

    (This calculation also includes converting bytes to kilobyte and bytes to bits.)

3500 messages, each of 1KB in size, yield a total 28.672MB. We used NAT forwarding method so this amount must be doubled to estimate the network load that is handled by Load Balancer. This approach yields 57.344MB. We conducted the tests on a 100MB network so this result was a feasible capacity for the hardware that we were using.

The next step was to ensure that the network was configured to minimize latency. As latency increases, there is more data on the network. Therefore, the network utilization increases even though the actual throughput does not increase. Latency increases every time a router or switch or other networking device is traversed. To minimize the latency, we placed our stress generation tools, Load Balancer, and the back-end servers on the same Ethernet switch.


Stress Tool Tuning
The stress tools were deployed on Linux® machines. As we increased the connection requests on Load Balancer, the stress tools began reporting failures when failures were not reported by Load Balancer. Upon further investigation, the problems resulted from the ephemeral port range and the number of available file descriptors.

The ephemeral port range on a client is the range of port numbers that are available to make a connection to a well-known port. For example, an HTTP Server is defined as port 80. When a client makes an HTTP request, the client randomly chooses an available port on its machine to communicate with port 80 on the server machine. If the client closes the connection, which is an active close, the client must wait for the TIME_WAIT period before this port can be reused. This design allows for packets, which might be delayed or resent, to complete the transaction cycle after the connection is closed.

Increase the range of ephemeral ports to the maximum value to achieve the highest throughput. Increasing the ephemeral port range is specific to the operating system. For the Linux® 2.6 kernel clients used for this test, we made the following change:
  • echo net.ipv4_ip_local_port_range = 1024 65000 >> /etc/sysctl.conf 

If the stress tool is closing the connections and has a default TIME_WAIT of 60 seconds, there are a maximum of 1066 connections per second:
  • ( 65000 – 1024)/60 = 1066.3.

If you decrease the TIME_WAIT period to 30 seconds, you can double the maximum throughput as limited by the ephemeral port table:
  • ( 65000 – 1024) / 30 = 2132.5 connections per second.

If the server closes the connection, the client does not need to wait to reuse the port so the ephemeral ports can be reused as quickly as possible.

For each connection that the client opens, the client must have a file descriptor available. To avoid an error from the stress tool stating "no file descriptor available," increase the maximum number of open file descriptors for each process. For the Linux® 2.6 kernel clients that were used for this test, we made the following change:
  • echo ulimit –n 200000 >> /etc/profile 


General Server Tuning
We executed the test with an HTTP Server application that was deployed on AIX® servers. In testing, we decreased the timewait period and increase the number of file descriptors available. The timewait period was decreased to 15 seconds with the following command:
  • /usr/sbin/no –o tcp_timewait = 15 
We used smitty to increase the number of file descriptors per process to an unlimited value (-1).


Load Balancer Tuning
As previously stated, the most important performance factor for Load Balancer is the forwarding method that is chosen. Whenever possible, use the MAC forwarding method.

Load Balancer has 64510 ports per return address for its ephemeral port table. To determine the maximum theoretical connections at the Load Balancer when you use the NAT or CBR forwarding methods, you must consider the fintimeout value and the number of return addresses.

The fintimeout is on the executor level and can be set with the following command (the fintimeout is not configurable on the Load Balancer for IPv4 and IPV6):
  • dscontrol executor set fintimeout value

The default value for fintimeout is 30. If the fintimeout value is decreased, the clients for an active close, a server for a passive close, or both need to have their TIME_WAIT value decreased to match the setting for Load Balancer.

The Load Balancer for IPV4 and IPv6 has a static value of 120 seconds for version 7 through 8.5.5.11, 9.0.0.0, 9.0.0.1, and 9.0.0.2.  Fix pack levels 8.5.5.12, 9.0.0.3, and higher use a static value of 30 seconds. Fix pack levels 8.5.5.18 and 9.0.5.4 allow the fintimeout value to be configured with the 'executor set' command.

To calculate the theoretical number of connections per second, multiply the number of ports per return address for Load Balancer by the number of return addresses, and divide the product by the fintimeout value:
  • CPS = (64510 * Number of Return addresses) / fintimeout
    where CPS = connections per second

Table 1 details how the fintimeout value and number of return addresses can affect the theoretical and actual connections per second.

Table 1: Effect of fintimeout and Return Address on Connections Per Second (CPS)
number of return addresses
fintimeout value
theoretical CPS
actual CPS
1 15 4300.7 4000
1 30 2150.3 1800
2 15 8601.3 7000

Note: One return address per server is allowed.


High Availability considerations
If you use High Availability, the active Load Balancer must notify the backup Load Balancer of each connection that increases the network traffic. For each connection, the primary Load Balancer notifies the backup machine when the connection is created and when it is destroyed.

These notifications are batched together within a packet:
  • A packet can contain 26 updates; each update is 43 bytes.
  • The Ethernet, TCP, and GRE headers account for 54 bytes.

If you add High Availability to the Load Balancer configuration the network traffic increases by the following total, in MBs:
  • Traffic increase = ( (CPS * 43) + (ceiling [CPS/26] * 54 ) * 8b/1B * 1Mb/1x10**6
    where CPS is connections per second.

For a 3500 connections per second, the calculation is:
  • (150,500 + 7290) * 8 / 1x10**6 = 1.262 Mb/s

Mutual high availability feature not available in the Load Balancer for IPv4 and IPv6
Do not use mutual high availability. In configurations with mutual high availability, each Load Balancer must send a heartbeat request to the partner every 0.5 seconds and respond to the heartbeat request for the partner. The connection record from each Load Balancer must be sent to the partner machine. There are often multiple packets that require processing, but these packets are not filled to the capacity of connection records. Therefore, the cost of transmitting and receiving these packets decreases performance, and there is no benefit to distributing the workload among the high availability partners instead of using a primary and backup Load Balancer machine.

In many cases, Load Balancer machines can be overloaded when they are running in mutual high availability configurations. When one of Load Balancer machines experiences a failure the other machine fails as well, because the workload is too great for a single machine. If regular High Availability was used, the Load Balancer machines would notice the increase in traffic and adjust their deployment to handle the increase rather than experiencing a critical failure.

Staletimeout value
The default value for staletimeout is 300 seconds (5 minutes). The staletimeout value controls when Load Balancer releases the memory for an idle connection. Back-end servers release the resources after a period of inactivity. The Load Balancer must be configured to use a smaller value or the same timeout value as the back-end servers. Do not increase this setting, as it might result in a significant performance impact.

The only protocol that requires a higher staletimout period is Lightweight Directory Access Protocol (LDAP). If you use LDAP, set the staletimeout value to a minimum of 3600 seconds (1 hour).

The Load Balancer for IPv4 and IPv6 has a default value of 6400 (2 hours) and must be configured to a more appropriate value.

Error reports
When you conduct performance tests, monitor the executor report for errors:
dscontrol>>e rep

The output resembles:
  • Executor Report: 
    ---------------- 
    Version level ................................. 06.00.02.00 - 20070130-111358 
    Total packets received since starting ......... 100,000 
    Packets sent to nonforwarding address ......... 5000 
    Packets processed locally on this machine ..... 0 
    Packets sent to collocated server ............. 0 
    Packets forwarded to any cluster .............. 95,000 
    Packets not addressed to active cluster/port .. 0 
    KBytes transferred per second ................. 1000 
    Connections per second ........................ 100 
    Packets discarded - headers too short ......... 0 
    Packets discarded - no port or servers ........ 50 
    Packets discarded - network adapter failure ... 25 
    Packets with forwarding errors................. 0
  • If you see a "Packets discarded – network adapter failure" error and you are using a NAT or CBR forwarding method, add more return addresses. Errors occur in this field when an ephemeral port is not available to forward the request to the back-end server.
  • If you see a "Packets discarded – no port or servers" error, examine the incoming traffic to ensure it is on the correct port. You might want to define a wildcard port (0) and direct traffic to a server with an IP trace tool running to determine the source of the incoming traffic. A third party might be scanning your servers for a security vulnerability.
  • If you see a "Packet with forwarding errors" error, examine the report for other error counts. This count is often increased with other errors such as network adapter failure.

For the Load Balancer for IPv4: you can also run the following command to ensure that your tests are not experiencing problems:
  • executor xm 1 

The output resembles:
  • bash-2.01# dscontrol e xm 1 
    XMISC COMMAND SHOULD BE USED FOR DEBUGGING 
    PURPOSES ONLY. THESE ARE UNSUPPORTED COMMANDS. 
    THESE COMMANDS HAVE NOT BEEN EXTENSIVELY TESTED. 
      
    Primary FSM: 
    CntRcv ............ 34407 (0x00008667) 
    CntSnd ............ 396353 (0x00060C41) 
    CntSndMisc ........ 21676 (0x000054AC) 
    CntSndDbSyncRsp ... 257 (0x00000101) 
    CntSndDbReach ..... 7 (0x00000007) 
    CntSndDbUpd ....... 374413 (0x0005B68D) 
      CntGratArp ........ 14 (0x0000000E) 
    CntNoCRKeepAlive .. 0 (0x00000000) 
    CntNoCRBreak ...... 3 (0x00000003) 
    CntNoServerInCR ... 0 (0x00000000) 
    CntStateChanges ... 19 (0x00000013) 
    CntSubStateChanges. 17 (0x00000011) 
    Timer2 ............ 0 (0x00000000) 
    Backup FSM: 
    CntRcv ............ 83327 (0x0001457F) 
    CntSnd ............ 24705 (0x00006081) 
    CntSndMisc ........ 21789 (0x0000551D) 
    CntSndDbSyncRsp ... 269 (0x0000010D) 
    CntSndDbReach ..... 6 (0x00000006) 
    CntSndDbUpd ....... 2641 (0x00000A51) 
      CntGratArp ........ 8 (0x00000008) 
    CntNoCRKeepAlive .. 0 (0x00000000) 
    CntNoCRBreak ...... 6801 (0x00001A91) 
    CntNoServerInCR ... 0 (0x00000000) 
    CntStateChanges ... 20 (0x00000014) 
    CntSubStateChanges. 17 (0x00000011) 
    Timer2 ............ 0 (0x00000000) 
    Misc debug: 
    CntSndEncap ....... 29038
    CntRcvExcap ....... 0
    CntNtStarTrue ..... 0
    CntNtStarFalse .... 0
    CntSloTimDec ...... 0
    CntSloTimInc ...... 0
    CntCPASrvMismatch . 0
    TimeLastSlowTimeout 1168854229 (0x45AB4CD5) 
    CntNoEphPortAvail . 0
    CntNoNPSAvail ..... 0
    CntNoCRAvail ...... 0
    CntCRsIntactInCT .. 0
    CntCRsFreedFromCT . 0
    CntSndRstBadPkts... 0
    CntNpsRcv ......... 89260911
    CntNpsGet ......... 965462
    (CntNpsInTotal ..... 90226373)
    CntNpsFwd ......... 89060169
    CntNpsColoc ....... 38525
    CntNpsFree ........ 309031
    (CntNpsOutTotal .... 89407725)
    CntPktReas ........ 0
    CntFrgReas ........ 0
    CntFrgGC .......... 0
    TimeAsyncRstStart.. 0 (0x00000000) 
    TimeAsyncRstStop... 0 (0x00000000) 
    CntAsyncRstQueued.. 0 (0x00000000) 
    CntAsyncRstSent.... 0 (0x00000000) 
    TCP MSS ........... 1460
    Arg27.............. 0 (0x00000000) 
    Arg28.............. 0 (0x00000000)
 
  • The CntGratArp field indicates that a Load Balancer in high availability mode went active. If you are running performance testing, ensure that takeovers are not occurring. A takeover decreases the throughput for the period in which the addresses for the cluster addresses are being moved from one Load Balancer machine to the other.
  • If you see the CntSendEncap counter with a nonzero counter, then your stress clients are reusing ports before the complete timewait period. Decrease the fintimeout setting of Load Balancer to match the setting for the stress clients. Some stress clients do not honor the timewait setting that is defined. Decrease the amount of stress from each client and deploy more stress tools to conform to the TCP RFC for testing.
  • If you see the Arg28 value increasing, add more return addresses. This setting is another indicator that Load Balancer is running out of ephemeral ports.

For the Load Balancer for IPv4 and IPv6: you can run the following command to obtain forwarding statistics (tooling found in the load balancer server/bin directory):

lbcommand.sh 0 get

The output resembles:

-bash-4.2# ./lbcommand.sh 0 get
executor key: (CONFIG) 0/0/0/0 action: 3(GET) rc: 1(SUCCESS)
  children: 9.37.208.86/0/0/9.37.208.254 9.37.208.168/0/0/9.37.208.167 9.37.208.105/0/0/0 9.37.208.250/0/0/0
  addable: n/a
  setable:
    logmask: 0x254000200 ( os arp neigh ha error )
    (ha)role: 2 (PRIMARY)
    (ha)takeoverstrategy: 1 (MANUAL)
    (ha)replicatestrategy: 0 (NOREP)
    (ha)reachscore: 0
    (ha)timeout: 2
    (ha)reachaddrs: 0 0 0 0 0 0 0 0 0 0 0 0
    (ha)reachstatus: 0 0 0 0 0 0 0 0 0 0 0 0
    (ha)port: 10150
    (ha)manualtakeoverrequested: 0
    Legacy: nfa: 9.37.208.86
    Legacy: clientgateway: 9.37.208.1
    Legacy: clientgateway_ipv6: 0:0:0:0:0:0:0:0
    Legacy: cps: 0 bps: 0 maxclusters: 0 maxports: 0
    Legacy: maxservers: 0 fintimeout: 0 staletimeout: 0 stickytime: 0
    Legacy: weightbound: 0 configfilename:
  getable:
    starttime: 0
    version: 8.5.5-10 - 20161012-151943 [wsbld614] AIX/ppc64/xlc80
    stats/ipv4: in:4849112 [ fwd:1632417 err:2 notforus:47856 discard:23 ]:1680298 gen:345591
    stats/ipv6: in:2304 [ fwd:0 err:0 notforus:2304 discard:0 ]:2304 gen:0
    stats/arp:  in:334336 [ fwd:65157 err:0 notforus:269179 discard:0 ]:334336 gen:82
    stats/icmp:   in:201 [ fwd:2 err:0 notforus:176 discard:13 ]:191 gen:0
    stats/icmpv6: in:0 [ fwd:0 err:0 notforus:0 discard:0 ]:0 gen:0
    clientgwneighstat: 0x1 ( RXOK )
    clientgwneighstat_ipv6: 0x0 ( undef )
    (ha)state:    1 (ACTIVE)
    (ha)substate: 1 (NA)
    (ha)lastevent: 0
  • IP error counts shown as stats/ipv4 err: and stats/ipv6 err: are summed and shown as "Packets with errors" on the executor report. Errors occur when there is insufficient memory or if there is an error in the configuration that prevents forwarding.
  • IP discards shown in "stats/ipv4 discards: and stats/ipv6 discards:" are summed and shown as "Packets discarded" on the executor report.  Errors occur when there are no available servers to receive requests, requests are received for an undefined port, ports are reused prematurely, or the packet size exceeds the MTU.
  • ICMP packets (indicated by stats/icmp gen: and stats/icmpv6 gen: ) indicate there is a problem forwarding requests. Either the packet hop count ended (indicating a routing loop) or the incoming packet size is larger than the maximum size load balancer supports.

Addendum
There are many variables that affect performance, the most common causes for throughput limitations were discussed in this document, not all causes or cases are documented.

In this document, calculations for the theoretical number of maximum connections per seconds assume that you are using a dedicated network; this scenario is not the case in an actual production system.


Reference
Tuning performance, WebSphere Application Server – 8.5.5

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"ARM Category":[{"code":"a8m50000000CdIqAAK","label":"IBM Edge Load Balancer"}],"ARM Case Number":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"8.5.0;8.5.5;9.0.0;9.0.5"}]

Document Information

Modified date:
27 December 2021

UID

swg27013915