IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
developerworks > My developerWorks >  Dashboard > IBM Database Wiki > ... > HADR_home > HADR_sim
developerWorks
Log In   View a printable version of the current page.
HADR_sim
Added by zhuge, last edited by zhuge on Aug 05, 2009  (view change)
Labels: 

Back to HADR wiki home

IBM DB2 High Availability Disaster Recovery (HADR) Simulator

Introduction

HADR (High Availability Disaster Recovery) is a DB2 feature that provides high availability and disaster recovery via data replication. When enabled, database logs from the primary database are shipped to the standby database in real time. The standby database continuously replays the received logs, thus staying in synch with the primary database.

HADR is easy to use and versatile. It is fully integrated in DB2. Setup only requires a few database configuration parameters. It uses standard TCP interface to connect the primary and standby databases. It requires no special hardware or software.

For HA, primary and standby database are typically located fairly close to one another. For DR, the databases are typically located at different sites. HADR provides fast failover in case the primary database fails. You can also switch the roles of primary and standby databases easily and quickly for scenarios such as rolling upgrade or rolling maintenance, keeping down time to minimum.

With IBM DB2 High Availability Disaster Recovery (HADR) Simulator, HADR planning and deployment is even easier. You can estimate HADR performance under various conditions without even starting any database. HADR simulator simulates DB2 log write and HADR log shipping. It can

  • Estimate HADR performance under different sync modes.
  • Measure disk performance.
  • Measure network performance using HADR specific workload.
  • Experiment with hypothetical disk speed.
  • Experiment with various network tuning options.

The simulator helps the user to plan, measure and diagnose HADR quickly. The simulator is very light weight. It consists of one process running on the primary host as the primary and one process running on the standby host as the standby. No setup or installation is needed. All configuration is passed as command line argument. The simulator itself is a standalone executable. No installation of DB2 or any other software packages is needed. You can run it on any machine and get result in seconds, with no impact on your production system.

The simulator arguments closely match HADR configuration, using parameters like local host, local port, remote host, remote port, tcp socket send/recv buffer. Tuning parameters in the simulator can be applied to real HADR straight forward.

The simulator is created by the DB2 HADR development team to help DB2 users better understand and deploy HADR. The algorithms and even some of the source code are directly extracted from the real HADR.

Executable Download

IBM DB2 HADR Simulator and db2flushsize script are provided free of charge to current and potential IBM customers. They are provided as is, with no explicit or implicit warranty. Read licensing terms at IBM Software License before you proceed. Download only if you agree on the license terms.

The HADR simulator executable (simhadr) is stand-alone. It does not need any DB2 libraries. You can run it on machines without DB2 installation. Binaries for selected platforms are pre-built. If you need the simulator on other platforms, contact us.

A helper script, db2flushsize, can be used to calculate flush size from log files. The flush size can then be fed to the simulator for more accurate runs.

  • db2flushsize Perl script to calculate flush size from log files.

Command Line Options

Run simhadr with no argument for a brief description of command line options. Additional notes below.

Required Options

  • -role option specifies the role of the process. Start primary process on primary host machine and start standby process on standby host machine. The two processes will connect to each other via TCP. You can start either process first.
  • -lhost, -lport, -rhost, -rport are equivalent to DB2 configuration parameters HADR_LOCAL_HOST, HADR_LOCAL_SVC, HADR_REMOTE_HOST and HADR_REMOTE_SVC. These are the TCP addresses of primary and standby. Note that while the simulator accepts both host name and IP address for host argument, it accepts only port number for port argument (Real HADR accepts both port number and service name).

Basic Options

  • -n and -t arguments control the length of the simulation run. They are accepted only on the primary. You can control the length by number of messages (-n), or by time (-t). Once primary reaches the limit, it will disconnect from the standby and end the run. Standby will detect the disconnection and end its run too.

During a run, you may send SIGINT (usually by pressing Control-C) to the primary any time to stop the simulation. Upon the signal, the primary will stop the run and the usual simulation result will be printed on both primary and standby. Note: SIGINT and Control-C works on both Unix and Windows.

  • -syncmode specifies the HADR sychronization mode. It is equivalent to DB2 configuration parameter HADR_SYNCMODE. Syncmode, remote catchup (RCU), and flush size are propagated from primary to standby. They are accepted only on the primary side.
  • -rcu tells simhadr to simulate remote catchup (RCU) state log shipping. The default behavior is to simulate peer state, using the specified sync mode. In RCU state, log shipping is similar to ASYNC mode, where no ack message is sent from standby to primary.

In RCU state, the primary retrieves logs from log files to send to the standby while in peer state, logs are sent directly from primary log write buffer. Because log read is usually not the bottleneck and log read speed is difficult to model as it may involve log archive device, the simulator assumes infinitely fast log read (-disk option is applicable only to log write). The standby writes received logs in both peer and RCU state, for the purpose of crash recovery and takeover readiness.

  • -flushSize specifies size of log flushes in unit of 4k byte log pages. Default is 32 for peer state and 16 for remote catchup. See flush size for more info.
  • -hadrBufSize sets standby recv buffer size in unit of 4k byte log pages. Default is 4 times flush size. This parameter is equivalent to DB2 registry variable HADR_BUF_SIZE.
  • -target sets target log rate to specified MBytes/sec. Default infinity. This is useful when you want to throttle the log rate. For example, you are testing network stability rather than throughput. Applicable on the primary only.
  • -verbose gives you detailed info for each write, send, recv, and congestion. In most cases (especially when simhadr is writing to a terminal), the extra info will significantly slow down the simulation. Do not use -verbose for performance runs. Use only for per message debugging.

Network Options

  • -sockSndBuf and -sockRcvBuf set socket send or recv buffer size. These parameters are equivalent to DB2 registry variable DB2_HADR_SOSNDBUF and DB2_HADR_SORCVBUF. The default is system assigned size.
  • -nagle turns on nagle (combining send requests). In real HADR, nagle is always disabled. Nagle reduces total amount of network traffic, at the cost of application response time. This option lets you experiment on the impact of nagle.
  • -block tells the simulator to use blocking send and receive. Real HADR always uses nonblocking send and receive. This option lets you test network throughput in blocking mode. On properly configured systems, there should be little difference between blocking and nonblocking mode. Note: Primary and standby need not have the same -block option. You can use -block only on one side.
  • -bind, By default, the simulator does not bind standby socket's local end when it connects to the primary. Thus while the remote end of the socket is bound on the primary's local host and port (because primary listens on the address and standby connects to it), the socket's local end address is assigned by the standby host machine. Normally, the standby machine assigns a local IP that is on the best route to the primary end. The assigned IP may not be the canonical address of the standby machine. In most cases, it will be the address of the standby's HADR_LOCAL_HOST, because that address should be the best local end to reach the primary. In rare cases, such as misconfigured routing table, the assigned local end may cause inefficient network routing. You may use -bind option to force the standby to bind its local end to the local host argument of standby. Note: This option is for diagnostics only. Real HADR has no option to force standby local binding.

Disk Options

  • -testDisk creates a temp file in specified path to test disk performance. It will write to the temp file and measure write speed. As in real DB2, synchronous write is used (write() does not return until data is on disk). Write size can be controlled via -flushSize option. Simulation run time can be controlled via -n or -t option. If -verbose is on, a message will be printed on each write. Upon completion, the temp file is left for user inspection and removal.
  • -disk specifies disk speed using two parameters: data rate in MBytes/sec and per write overhead in seconds. The default is infinitely fast disk (disk write takes no time). Thus if you do not specify -disk option, the simulation will only reflect network bottleneck (CPU is rarely a bottleneck in log shipping).

Simulator Timing Options

  • -spin tells simhadr to use specified spin time to compensate for sleep overhead. Default measures sleep overhead on the system. Simhadr uses sleep (actually, its variation of nanosleep() on Unix and Sleep() (millisecond sleep) on Windows) to simulate write time. However, the OS only guarantees that the process will sleep no less than requested time. Actual sleep time is usually longer. When the OS wakes up the process, it may take some time for the process to actually run on a CPU. For accurate timing, simhadr will request less sleep time and top off the sleep time with CPU spinning. The auto measured sleep overhead usually works well. If you see too many warnings from simhadr about actual sleep time longer than requested time, increase spin time.
  • -testSleep tests accuracy of specified sleep time. It tests how well the sleep-and-spin method works. This options is used to test -spin settings.

Interpreting simulator output

Below is sample output of a primary process. Standby output is similar.

+ simhadr -role primary -lhost brant -lport 46000 -rhost auklet -rport 46000 
          -sockSndBuf 64000 -sockRcvBuf 64000 

Measured sleep overhead: 0.003608 second, using spin time 0.004329 second.
SyncMode = NEARSYNC
flushSize = 32 pages
Simulation run time = 4 seconds

Resolving local host brant via gethostbyname()
hostname=brant.beaverton.ibm.com
alias: brant
address_type=2 address_length=4
address: 9.47.73.49

Resolving remote host auklet via gethostbyname()
hostname=auklet.beaverton.ibm.com
alias: auklet
address_type=2 address_length=4
address: 9.47.73.42

Socket property upon creation
BlockingIO=true
NAGLE=true
SO_SNDBUF=16384
SO_RCVBUF=87380
SO_LINGER: onoff=0, length=0

Calling setsockopt(SO_SNDBUF)
Calling setsockopt(SO_RCVBUF)
Socket property upon buffer resizing
BlockingIO=true
NAGLE=true
SO_SNDBUF=128000
SO_RCVBUF=128000
SO_LINGER: onoff=0, length=0

Binding socket to local address.
Listening on local host TCP port 46000

Connected.

Calling fcntl(O_NONBLOCK)
Calling setsockopt(TCP_NODELAY)
Socket property upon connection
BlockingIO=false
NAGLE=false
SO_SNDBUF=128000
SO_RCVBUF=128000
SO_LINGER: onoff=0, length=0

Sending handshake message:
syncMode=NEARSYNC
flushSize=32
connTime=2008-07-09_17:12:25_PDT

Sending log flushes. Press Ctrl-C to stop.

NEARSYNC: Total 410255360 bytes in 4.000821 seconds, 102.542793 MBytes/sec
Total 3130 flushes, 0.001278 sec/flush, 32 pages (131072 bytes)/flush

Total 410255360 bytes sent in 4.000821 seconds. 102.542793 MBytes/sec
Total 10893 send calls, 37.662 KBytes/send, 
Total 3136 congestions, 0.848714 seconds, 0.000270 second/congestion

Total 150240 bytes recv in 4.000821 seconds. 0.037552 MBytes/sec
Total 3130 recv calls, 0.048 KBytes/recv

Distribution of log write size (unit is byte):
Total 3130 numbers, Sum 410255360, Min 131072, Max 131072, Avg 131072
Exactly     131072        3130 numbers

Distribution of log shipping time (unit is microsecond):
Total 3130 numbers, Sum 3989623, Min 1262, Max 1622, Avg 1274
From 1024 to 2047               3130 numbers

Distribution of congestion duration (unit is microsecond):
Total 3136 numbers, Sum 848714, Min 208, Max 824, Avg 270
From 128 to 255                  120 numbers
From 256 to 511                 3011 numbers
From 512 to 1023                   5 numbers

Distribution of send size (unit is byte):
Total 10893 numbers, Sum 410255360, Min 752, Max 86880, Avg 37662
From 512 to 1023                 192 numbers
From 2048 to 4095                  3 numbers
From 4096 to 8191               2661 numbers
From 8192 to 16383              1777 numbers
From 16384 to 32767             2913 numbers
From 32768 to 65535              217 numbers
From 65536 to 131071            3130 numbers

Distribution of recv size (unit is byte):
Total 3130 numbers, Sum 150240, Min 48, Max 48, Avg 48
Exactly         48        3130 numbers

Simhadr prints out details of host name resolution. The same function is used in real HADR to resolve host names. The output can be used to debug HADR configuration. For example, you can see the actual IP address a host name is resolved to.

Socket property is printed upon creation, buffer resizing and connection. This helps you to track down potential socket problems. In the example, granted socket buffer size is twice as requested. This is normal on Linux. Th OS grants more memory because it takes memory overhead into consideration. On AIX, if Interface Specific Network Option (ISNO) is enabled, socket property may change upon connection because the OS can determine the actual network interface used by the socket only upon connection.

The sample ran on a giga bit ethernet. The throughput is 102 MB/second. Note that this is NEARSYNC mode, where round trip messaging is required for each flush (primary needs an ack message from standby). Simhadr reports average log shipping time as 1.3 millisecond. Compared to .1 millisecond round trip time on this network (measured by "ping"), the overhead of ack message is insignificant. Thus NEARSYNC mode throughput is not too far behind the raw bandwidth of the network.

On the same system, when running in ASYNC mode, throughput is 115 MB/second, near the full bandwidth of the giga bit network.

The distribution of various metrics are printed. For log shipping time, there are 3130 samples (because there are 3130 flushes). Average is 1274 microsecond. All numbers are in the range of 1024 to 2047. The numbers show that log shipping time is very consistent. This is good news to transactions.

If a distribution has 8 or fewer distinct values, simhadr prints out the distinct values and their occurance count. For example, receive size has only one distinct value: 48. The ack message from standby to primary is exactly 48 bytes. Because the message is small, it is always received in one recv call.

For send, transient congestions are hit because the process can deliver data to TCP layer faster than than TCP can send. TCP throttles the sender by returning "resource temporarily unavailable" error code. Upon receiving this code, simhadr waits on select() until the OS indicates that the socket is writable again. Average send size is 37 KB. So on average, each flush took 3 send calls to complete.

For distribution of congestion time, average is 270 microsecond, or .27 millisecond. The numbers are mostly in the range of 256 to 511. This is expected as the round trip time on the network is .1 millisecond. The sending socket can release its buffers for more send only when it receives ack (TCP level ack, not to be confused with HADR ack) from the receiver. Thus the time to get out of congestion is usually in the same order as the round trip time.

Transient congestion is normal and expected when using non-blocking socket. There will be more info later in this paper in section "Non-blocking IO and sender congestion". The overhead to handle congestion is insignificant. On the same system, when using larger socket send buffer, congestions are completely avoided, but throughput remains the same.

Note that the example did not specify a disk speed. So the numbers reflects network as the sole bottleneck. In the real world, disk throughput is typically between 30-60 MB/second, slower than the sample network. Thus on LAN, HADR log replication usually adds only a small overhead to database logging when NEARSYNC and ASYNC modes are used (these modes use parallel log write and send). More info in "sync mode" section later.

Simulation Parameters and Scenarios

Bit size and Platform

DB2 HADR simulator can be compiled into either 32 or 64 bit applications. The bit size is shown in simhadr help message. Normally, there is little performance difference between the bit sizes. For best result, it is recommended that you run simhadr with the same bit size as your DB2 installation.

Real HADR requires the primary and standby host machines to be of the same OS type and machine architecture. The simulator has no requirement on OS type, or machine architecture (primary and standby can be on machines of different endian). This allows you to experiment with different platforms.

Flush size

In primary and standard databases, SQL agent EDUs produce logs and write the logs into database log buffer, whose size is controlled by database configuration parameter LOGBUFSZ. The loggw EDU (there is only one loggw per database) consumes the logs by writing them to disk. The write operation is called log flush. Each write operation (and the logs written) is called a flush. For HADR primary databases, in peer state, each flush is also replicated to the standby database. Log write is not considered complete (transactions can not commit) until the replication is complete. Depending on the sync mode, replication completion requires passing data to TCP layer (ASYNC mode), receiving ack message from standby that logs have been received (NEARSYNC mode), or receiving ack message from standby that logs have been written to disk on standby (SYNC mode).

Loggw EDU does not wait for the log buffer to be full to flush it. A transaction commit will generate a flush request. If there is no request, loggw will still wake up from time to time to flush the log buffer. The size of each log flush is non-deterministic. When there are multiple client sessions, multiple requests can be combined in one flush. The logger is designed to be self tuning. If a flush takes longer time, when the logger completes the flush, there will be more outstanding requests, and therefore stronger grouping effect on commit requests, improving performance by reducing number of writes.

The maximal flush size is limited by log buffer size and total log data from outstanding (waiting to commit) transactions. The default 32 page flush size of simhadr assumes a "typical" OLTP system. 32 pages is 128kbyte. So it represents a system of 1000 client connections, each submitting 128 byte transactions. In thoery, a log buffer twice as large as outstanding log data is enough, because when loggw writes half of it, sessions can write to the other half. In practice, you need bigger buffer as the logger does not write wrap around flushes. It will only flush to end of buffer and then flush from the beginning in the next flush. Also bigger buffer allows rollback to read more logs from the buffer instead of disk.

When running simhadr, you may use a flush size based on log write buffer size and transaction workload. This gives you the theoretical maximal flush size, therefore theoretical maximal throughput. You may also run simhadr using the flush size measured on your DB2 database.

There a two ways to measure flush size on a database:

  • Divide database snapshot element "Log pages written" by element "Number write log IOs". This is an estimation because when a partial page is flushed, it can take more than one write operations. Thus the result tends to be smaller than actual flush size. Database snapshot can be retrieved in multiple ways. Among them, the "db2 get snapshot for database" command and the SNAP_GET_DB_V91 table function.
  • Use attached perl script db2flushsize to look for end of flush flags in log files. This is an estimation because after a partial page is flushed, its flag will be overwritten when a newer version of the page is written. Record of the partial page flush is then lost. The result from this method tends to be higher than actual flush size. Note: On Unix systems, you may either invoke the script via perl interpreter or add execution permission to it to invoke it directly. On Windows, you can only invoke it via the perl interpreter.

The methods are more accurate on busy systems, where there are fewer partial page flushes. DB2 flushes a partial page only when it is requested by transaction commit and there is no full pages to flush. On a busy system, chances are that there are full pages to flush when a transaction requests commit. Logger will flush the full pages first. By the time the logger finishes flushing the full pages, the partial page at log end may already be filled and logger will again flush full pages.

If the flush size measured on a database is less than the theoretical maximal size based on log write buffer size and transaction workload, it may just mean that the system is not at full capacity yet. If the load gets heavier, the system will adapt by running with larger flush size.

The dynamic nature of log flush means that factors like system load, turning HADR on or off, changing HADR sync mode may significantly change the flush size. It is recommended that you run simhadr using a series of flush sizes to understand the impact of flush size on your system and estimate the range of HADR throughput.

Flush size in remote catchup state

For simulation of remote catchup state, simhadr defaults to 16 pages for flush size. This is because in RCU state, the primary issues read requests using 16 page buffers. In most cases, the file system will return 16 pages to the call, so the primary will send 16 page blocks to standby.

Sync mode

HADR provides 3 synchronization modes to suit a diverse range of operational environment and customer requirements.

  • SYNC - Transactions on primary will commit only after relevant logs have been written to disk on both primary and standby.
  • NEARSYNC - Transactions on primary will commit only after relevant logs have been written to disk on primary and received into memory on standby.
  • ASYNC - Transactions on primary will commit only after relevant logs have been written to local disk and sent to standby.

For SYNC and NEARSYNC modes, the primary will wait for an ack message from the standby to confirm that the logs have been received and written to disk on standby (SYNC mode) or have been received on the standby (NEARSYNC mode). For ASYNC mode, primary will consider replication done as soon as the logs are delivered to the TCP layer of the primary host machine.

SYNC mode gives best protection of data. Two on-disk copies of data are required for transaction commit. The cost is the extra time for writing on standby and sending the ack message back to primary.

In SYNC mode, logs are sent to standby only after they are written to primary disk. Log write and replication happen sequentially. The total time for a log write is the sum of (primary_log_write + log_send + standby_log_write + ack_message). The cost of replication is significantly higher than that of other modes.

NEARSYNC mode is nearly as good as SYNC, at significantly less cost. Standby sends ack message as soon as it receives the logs in memory. Furthermore, sending logs to standby and writing logs to primary disk are done in parallel. On a fast network, log replication causes no or little overhead to primary log writing.

In NEARSYNC mode, you will lose data if primary fails and the standby fails before it has a chance to write received logs to disk. This is a relatively rare "double failure" scenario. Thus NEARSYNC is a good choice for many applications, providing near synch protection at far less performance cost.

In ASYNC mode, sending logs to standby and writing logs to primary disk are done in parallel, just like NEARSYNC mode. Because ASYNC mode does not wait for ack messages from the standby, primary system throughput is min(log write rate, log send rate). ASYNC mode is well suited for WAN application. Network transmission delay does not impact performance in this mode. But if the primary database fails, there is a higher chance that logs in transit are lost (not replicated to standby).

The simulator sends and receives log pages and ack messages with actual size, although the messages contain dummy content. Disk writes are simulated using sleep(). No data is actually written. The network workload generated by simhadr is identical to that of real HADR. By running the simulator, you can preview the performance of various sync modes and test your network before you deploy HADR.

Disk speed

Simhadr can measure disk speed and simulate disk write.

To measure disk speed, use "-testDisk" option. To specify disk speed to simulation runs, use the "-disk" option.

Disk write time is modeled as: write_time = data_amount * write_rate + per_write_overhead. Write_rate is in unit of MB/second. Per_write_overhead is in unit of second. Theoretically, given the write time of two runs with different data amount, you can solve the equation and get write rate and per write overhead. To make things simple, you can do a run with 1 page flush size. The reported write time is an approximation of per write overhead. Then do a run with a large flush size such as 500 or 1000. The reported MB/second is an approximation of write rate.

When testing disk, simhadr issues synchronous write (write() does not return until data is on disk), just like log writing in real DB2. Simhadr does not remove the temp file created for disk testing. You may examine, then delete the file. For example, you may want to examing the content of the file, or the degree of sector fragmentation on the file.

With a single disk, typical write rate is 30 to 60 MB/second. Typical per write overhead is 1 to 20 millisecond. Newer disks usually have shorter per write overhead. Disk arrays may have better performance. A device with short per write overhead is recommended as log device.

Once you have the disk speed parameters, you may feed it back to simhadr using -disk option. When disk speed is specified, simhadr will compute the time needed to write a log flush and use sleep() to simulate the write. No actual data is written. This allows you to use hypothetical disk speed for "what if" questions like "what if my disk is faster?".

TCP buffer size

Beginning in DB2 V8fp17, V91fp5 and V95fp2, you can use registry variable DB2_HADR_SOSNDBUF and DB2_HADR_SORCVBUF to set HADR socket send and receive buffer size. For older releases, you need to set socket buffer size at system level (setting is applicable to all applications on the machine).

In simhadr, these options are set via -sockSndBuf and -sockRcvBuf options. Simhadr allows you to experiment with various sizes to find out the optimal setting. Simhadr reports socket buffer sizes upon socket creation, buffer resizing, and connection. These numbers are very useful for tuning the network. The size upon socket creation is the system default. In some cases (AIX interface specific network option), the size may change upon connection (more info below).

The host machine may round up the requested size to certain sizes like power of two or multiple of network packet size, or just ignore the request because of system or user limit. In particular, Linux may grant a size twice of the requested size, in order to count for buffer overhead.

In real HADR, DB2 does not fail HADR startup if actual size is smaller than requested size. When you set those registry variables, you should check db2diag.log for messages like:

HADR Socket send buffer size, SO_SNDBUF: Requested X bytes, Actual Y bytes
HADR Socket receive buffer size, SO_RCVBUF: Requested X bytes, Actual Y bytes

to find out the actual buffer size.

On most systems, TCP socket buffer size controls TCP window size, which is a very important TCP tunable. The general rule is

TCP window size = send_rate * round_trip_time

TCP is a reliable protocol. The sender needs to keep a copy of sent data until the data is acked by the receiver. If the ack message does not come in time, the sender will need to resend. The minimal amount of time from the start of sending a piece of data to an ack message coming back is network round trip time. To fully utilize network bandwidth, the sender should be sending at full send rate while waiting for ack. Thus it needs to buffer "send_rate * round_trip_time" amount of data. Note that we are talking about the OS socket buffer and TCP ack message, which are different from the HADR buffer DB2 maintains and the HADR ack message DB2 sends. HADR buffer and messages are on a higher level in the network stack.

To determine the round trip time between two machines, the "ping" command can be used. For send rate or bandwidth test, there are many network tools available. Note that because HADR workload is different (for example, HADR uses nonblocking IO) from the load of other applications (such as plain FTP), HADR throughput may differ from others'.

On LAN, the system default socket buffer size is usually large enough because round trip time is short. Example: 1 GigaBit/second * .1 ms = 12 KBytes. Most systems have a default larger than 12 KBytes.

On WAN, the system default is often not large enough because of longer round trip time. Example: 10Mbit/second * 100ms = 125 KByte. Many systems have a default smaller than 125 KByte. Such systems would require setting TCP window size. Setting a large size at the system level would consume a large amount of memory if there are many connections on the system, as is the case of many client/server connections. Thus setting windows size for HADR only is desirable.

When window size is too small, the network will not be able to fully utilize its bandwidth. Applications like HADR will experience throughput lower than the nominal bandwidth. A larger than necessary size usually causes no harm other than consuming more memory.

Send and receive size are usually set to the same value and applied to both primary and standby databases. There may be cases where asymmetrical configuration is desired. Consult network experts for best configuration.

Note on HADR Sync and Nearsync modes

In Sync and Nearsync modes, HADR primary database will wait for an ack message from standby after sending out a flush. Even though the wait and ack are on application level (not to be confused with TCP level ack messages), it has the effect of retricting send window to flush size. A TCP send buffer larger than flush size will not improve performance because send is blocked after sending out a flush. In such cases, the bottleneck is the small flush size combined with round trip time. For more info on flush size, see flushSize

If you use async mode, then flush size has little impact on performance as the primary always sends one flush after another without waiting for ack from standby.

TCP Window Scaling

Larger than 64k TCP windows require TCP scaling. Some systems automatically enable window scaling when TCP buffer size is larger than 64k. Some require explicit enablement. Window scaling is also known as RFC1323.

When expected window size is greater than 64k bytes, check if you need to explicitly enable window scaling.

Non-blocking IO and sender congestion

Non-blocking IO

HADR uses non-blocking send and receive. The process sets non blocking flag on the socket. Thus send calls may return before all requested data has been send. For receive, recv may return no data. HADR calls recv only if select() indicates that there is data to receive to avoid futile recv calls. In contrast, many applications use blocking send/recv, where the application is blocked until all requested data has been sent, or for receive, at least some data has been received. The main reason of HADR using nonblocking IO is that HADR edu is multi tasking; it can not afford blocking.

Some systems may not handle non-blocking send/recv effeciently. Thus simhadr provides a -block option to test network performance using blocking send and recv. Normally, blocking and non-blocking IO gives nearly identical performance. If -block gives much better throughput, then the system has a problem processing non-blocking IO. OS tuning or patching will be needed.

Note: Primary and standby need not have the same -block option. You can use blocking IO on one side and nonblocking on the other.

Sender congestion

With blocking IO, a send call is blocked until all requested data is sent. For nonblocking IO, it may return before all data is sent. In particular, it may send zero bytes and return an error code indicating "resource temporarily unavailable". If HADR (real and simulator) encounters this return code, it stops calling send until select() indicates that the socket is writable again. While it is waiting, it considers the network "congested". In real HADR, "congested" state will be returned as network status. In the simulator, it keep statistics on congestion times and duration.

Encountering short congestion is normal in HADR. It is a normal part of flow control. On many systems, if the system cannot copy the requested data into socket buffer (buffer is full), it returns congestion to caller. As soon as some space is available in the buffer, the OS will notify the process of the availability of the socket via select(). This may seem inefficient compared to blocking send, but allows the process to multi-task, therefore reducing the number of processes and context switching among the processes.

Theoretically, the OS can reopen the socket for send as soon as there is one byte of free space in socket buffer. In practice, it may choose to wait until a certain amount of space is available, just to avoid thrashing.

Windows behaves differently from the Unix systems. Windows will accept all requested data even if the the socket is nonblocking and send size is larger than TCP socket buffer. The send call returns quickly. Windows copies the data into another buffer. The next send call will return "resource temporarily not available" if the previous send has not completely drained. "select()" will return only when the previous send is drained. Then the next send() can go out. Thus for large sends, you will see alternate congestion and send. In contrast, there are a lot more short congestions on Unix systems.

There is another kind of longer congestion in real HADR. The standby replays logs directly from the log receive buffer. If the standby can not replay logs fast enough, its receive buffer will become full, unable to receive more logs. Once the network pipeline fills up, the primary won't be able to send. In peer state, such congestion will block transactions on primary. The congestion will last until standby replay makes progress and the standby receives logs again.

For both pipeline full congestion and transient network throttle congestion, the OS returns the same "resource not available" error to the primary. The primary can not differentiate the two kinds of congestions. It just reports "congestion" as connection status.

The duration of the congestion may help the user to differentiate the two kinds of congestions. When congestion is reported, HADR reports a "congested since" time. The duration of the congestion is the snapshot time minus the "congested since" time. If the duration is relatively long, such as more than a few seconds, then it is more likely to be a pipeline full congestion. If the duration is short, then it can be either kind.

A more reliable way is to issue the "db2pd -hadr -db dbName" command on standby to check standby buffer use percentage. If it's full (100%), then the congestion is caused by standby not receiving data, rather than network throttle. The buffer use percentage field is new beginning in DB2 V8fp17, V91fp5 and V95fp2. In older releases, you need to contact IBM tech support to retrieve the field via db2pd internal options.

Differentiating two kinds of congestions is important. For pipeline full congestion, you need to tune standby replay performance. For network throttling congestion, you need to tune the network if more throughput is needed.

Note: The simulator does not simulate replay speed. Any received logs are instantly consumed on the standby. Thus it will not encounter pipeline full congestion.

Primary blocking without congestion

In SYNC and NEARSYNC mode, primary logging can be blocked even if network status is not reported as congested. When standby receive buffer is full, in many cases, the primary can still send out one more flush. This flush will be buffered in the network pipeline between primary and standby. Standby can not fully receive it because its buffer is full. So standby can not ack. Thus primary will be stuck waiting for the ack message. When primary transactions are stuck, check standby receive buffer usage. If it is full (100%), then the cause is slow standby. Tuning or upgrading standby is the solution.

Known TCP issues

Windows bug for non-blocking send

Windows uses delayed ack for non-blocking TCP traffic. The receiving end does not ack immediately. The default delay is 200ms. When send size is larger than TCP socket buffer, sender may experience 200ms waiting on select(). This causes serious problem for HADR. The solution is to disable delayed ack on Windows.

See
http://support.microsoft.com/kb/823764 , "Slow performance using nonblocking socket on Windows"
http://support.microsoft.com/kb/311833 , "TcpDelAckTicks not working on some Windows versions"
http://support.microsoft.com/kb/321098 , "Possible side effect of TcpDelAckTicks"

Note: Changing TCP socket buffer size will only help with sends smaller than the buffer size. DB2 systems often have log buffer and flush size so large that it is not practical to set socket buffer to the size. Thus the recommended fix is to disable delayed ack via Windows registry.

AIX Interface Specific Network Options

AIX supports Interface Specific Network Options (ISNO). This allows setting of interface specific options to override system level network options. So a machine with multiple network interfaces can have different options for different interfaces. See also http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.commadmn/doc/commadmndita/interfaces_options.htm

When ISNO is enabled, a socket is assigned system level options upon creation. When the connection is established, the OS may reassign certain options based on the actual interface used for the connection. With simhadr, you will see socket properties changing upon connection.

When you set socket buffer size in real HADR or simhadr, the OS honors the requested size. The adjustment upon connection will grant a buffer no smaller than the requested size.

By default ISNO is enabled on AIX. Check ISNO setting if system level config does not seem to work. With ISNO enabled, system level config may be overridden by interface level config. If you are using older version of DB2 which does not support HADR specific socket buffer size (registry variable DB2_HADR_SOSNDBUF and DB2_HADR_SORCVBUF not supported), and there are multiple interfaces on the host machine, you can use ISNO to limit OS level socket buffer size to the interface hosting HADR connection.

Delayed Ack on AIX

AIX TCP receiving end sends ack (TCP layer ack) at the time a thread is reading the data from its socket buffer or, in the case the data isn't read, after a 200 milliseconds timeout. See tcp_nodelayack

The delayed ack could cause problem for HADR. There has not been observed HADR slow down due to this. This behavior is discussed here just as potential cause of network problems in HADR environment.

The 200 ms timeout can be tuned via AIX network option "fasttimo". Its default value is 200 milliseconds and it can be lowered to 50 milliseconds.

Another way to disable delayed ack is turning on tcp_nodelayack, which would cause immediate ack regardless of whether a thread is reading data or not.

Reducing delay time or disable delayed ack can cause excessive packets on the network, tuning must be done with care.

Diagnosing Intermittent Network Problems

If your HADR system is experiencing intermittent transaction slow down and the network is suspect, you can specify a target log rate via -target option on simhadr to test the stability of the network. The target option throttles simhadr so that it does not flood your network during the test. You can then run the simulator for a sustained time like several days. You can specify the duration via -t option or just use a very long -t time and stop the simulation manually via SIGINT (usually by pressing Control-C) to the primary process. The primary and standby will stop and print out the usual statistics upon the interrupt.

Then you may analyze the statistics for anything suspicious. Look for numbers far away from average in the statistics.

HADR receive buffer size

-hadrBufSize option sets standby recv buffer size. The default is 4 times flush size. This parameter is equivalent to DB2 registry variable HADR_BUF_SIZE. In real HADR, this buffer allows standby replay to fall behind primary log position for the size of the buffer, therefore absorb spikes in primary load. Since simhadr does not simulate replay, this field usually has little impact to performance. It is useful only when standby socket receive buffer size is large, and flush size is small, and sync mode is ASYNC (or remote catchup is specified). In such cases, larger -hadrBufSize may allow the standby to receive multiple flushes at once, potentially improving performance.

On most systems, changing -hadrBufSize will have little impact on performance. It is useful only when you suspect that the system is not performing because standby is receiving data in too many small pieces. Depending on network speed and how aggressively the OS combines multiple packets for receive calls, changing -hadrBufSize may not result in larger receive size. It only allows HADR to call recv with a larger buffer. It's still up to the OS to decide how much to fill the buffer before the recv call returns.

For SYNC and NEARSYNC modes, there can be at most one outstanding flush. The primary waits for an ack message once it sends out a flush. Standby can not receive more than one flush at a time. Thus larger -hadrBufSize has no impact in these modes.

Contact

For comments and questions on HADR simulator, post on DB2 forum at http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=231243

To submit request of binary on additional platforms, go to HADR_sim_platform
To submit feature request, go to HADR_sim_feature
To submit defect report, go to HADR_sim_defect

For comments on this web page, login to developerWorks, then click "Add comment" at bottom of this page.

Docs HADR_sim_defect (IBM Database Wiki)
Docs HADR_sim_feature (IBM Database Wiki)
Docs HADR_sim_platform (IBM Database Wiki)


    About IBM Privacy Contact