Thoughts on HDR PerformanceThe Spice Must Flow
Half 'n Half
Minimizethe Secondary Checkpoint
High available data replication has been part of IDS since 6.0. It has proved successful and is deployed at many customersites. Yet I still occasionally hear comments about performanceissueswhen HDR is turned on. So I thought I'd discuss some thoughtsabout how to improve the performance of the HDR pair.
Frank Herbert - 'Dune'
In the novel, "Dune", there is a phrase which is constantly repeated -"The spice must flow". Well - a similar thing must happenwithHDR, except with HDR it's "The Logs Must Flow". Let meexplainby examining the following diagram.
HDRworks by transferring the logs to the secondary where the recoverycomponent applys those logs. The secondary is in perpetualrecovery mode.
The logs are transfered to the secondary by copying the log buffer intoan HDR transmit buffer as part of the flush of the log buffer to disk. If using synchronous mode of HDR, then the HDR Transmitbuffer isimmediately scheduled for transmission and the thread which caused thelog flush is held until the ACK of that transmission is received fromthe secondary.
If using asynchronous HDR, then the HDR transmit buffer is notscheduledfor transmission until either the transmit buffer is full or until ithas aged to the DRINTERVAL time limit.
While the HDR transmit buffer is sized the same size as the log buffer,it is not a 1 to 1 relationship. If a log buffer is flushedwithonly part of the buffer being filled (as is often the case withunbuffered committed transactions), then only part of the HDR transmitbuffer is used. When the next log flush occurs, we could thencopy the new log pages into the remainder of that HDRtransmitbuffer.
The transmission of the HDR transmit buffer is sent to the secondaryusing a half-duplex protocol. That means that we can not sendthenext buffer until we receive an ACK for the previous transmission. While this may increase the delay in sending a log buffer, italso is necessary to ensure one of the main characteristics of HDR. That is the assurance that the secondary can take on the roleofthe primary with no loss of committed transactions.
In order to receive the HDR transmit buffer, the HDR receive threadmust first obtain a buffer from the HDR receive buffer pool. Ifit can not obtain a buffer, it will wait until the recovery threadsplace and empty buffer into the pool. When the receive threadreceives a buffer from the primary, it will ACK the buffer and thenqueue the received buffer of log pages to the recovery component. While it is a bit more complex, for our purposes we willthink ofthe recovery threads consisting of a main recovery thread an a bunch ofworker recovery threads.
The main recovery thread will split the log page into log records andthen queue that log record to one of the worker recovery threads basedon the partition number of that log record. All of the logrecords for a given partition will be applied serially by the sameworker recovery thread. Some of the log records are processedbythe main recovery thread only after all of the worker threads haveprocessed the log records queued to them. These log recordscanbe considered as a globallyserialized log record. The checkpointlog record is one such log record.
After the main recovery component is finished with an HDR buffer, thatbuffer is placed in the receive HDR buffer pool.
From this we can see that if we don't process things efficiently on thesecondary, then we will not be able to return an HDR receive bufferinto the receive queue quickly. If we can't get the receivebuffer into the HDR receive buffer pool quickly, then we will notbe able to easily get an HDR receive buffer in which to receive thedata transmission. If we don't receive a transmissionquickly,then we can't send the next buffer from the source due to half-duplextransmission. If we can't send a buffer from the source, thenwecan't return that full HDR transmit buffer to the HDR transmit bufferpool. If we can't return the HDR transmit buffer quickly totheHDR transmit buffer pool quickly, then the log flush logic is unable toget an empty HDR transmit buffer easily. If the log flushlogicis not able to easily get an empty HDR transmit buffer, then loggingwill be blocked. A problem with the apply on the secondarycan back flow and impact the primary. So as Mr. Herbert said, "Thespice must flow".
While this does increase the log consumption, it also ensures that theindex transfer does not use all of the transfer buffers and allowsnon-index log pages to be intermixed with the index log pages. This tends to equalize the impact of the index transfer withthe user esql threads.
There are two main flushing algorthms that we use. Thecheckpoint will use 'chunk writes' which involvs first sorting thepages for a specific chunk. By doing that, we are able totake advantage of write buddy bunching - and thus flush the buffersquicker. The other form of flushing is called LRU writes andare not as efficient as chunk writes, but will be performed inbetweencheckpoints to keep the buffer relative clean. You canimprove the time that it takes to perform the checkpoint by decreasingthe lru_min_dirty and lru_max_dirty items in the BUFFERPOOL. While thiswill decrease the work done during a checkpoint, it will also increasethe work done inbetween checkpoints. Also to maximizeperformance, you might want to consider making the number of lrus to beabout 2 times the number of CPUVPS. This should make it a biteasier to perform LRU page flushing. We are currentlyexamining ways in which we can minimize the impact of the checkpoint onthe secondary server.
The number of recovery threads is determined by the onconfig parameterOFF_RECVRY_THREADS. As a rule of thumb, there should be atleast 3 times the number of recovery threads as there are CPUVPS. The reason is that 1) the log records are spread across allof the recovery threads and 2) there is an increasedprobability of having to do a read into the buffer in order to processthat log record. If we only have as many recovery threads asCPUVPS, then there is going to be a lot of time waiting for readcompletion. By increasing the number of recovery threads toat least 3X the number of CPUVPS, then we can increase the probabilityof being able to work on another log record while waiting for the IOcompletion on another.
If there are indexes on a table, then we should make sure that theindex is located in a different partition than the data. (i.e. IDS 6 style of indexes). The reason for thisis that if we have version 5 style of indexes then the index is in thesame partition as the data. Since the passing out of the log records isbased on the partition, then if the index is in the same partition asthe data pages, then the index log records and the data log recordshave to be processed by the same recovery thread - which decreases theability to utilized all of the resources on the secondary. If the indexes are in the same partition as the data pages, then theapply is done serially.
Another thing that can be done to improve apply is to partition tableswhich are highly updated. Again, by adding fragmentation ofboth the data and the indexes, we can maximize the degree ofparallelism.