 |
IDS 10.50xC4 Enhancements
ER Enhancements in 11.50xC4
ER Enhancements
in IDS 11.50xC4
XML ATS/RIS files
New Event Alarms
Delete Wins
Background Sync / Check
Named Tasks
Cdr Stats
Parallel Sync /
Check
Checks with in-flight data
Repair Verification
Role Separation
We are getting fairly close to releasing IDS11.50xC5, and it contains
some fairly significant replication enhancements. However,
before I get into the new items in 11.50xC5, I want to spend a bit of
time discussing the things which were added in 11.50xC4. We
could think of this as sort of a 'prelude' to the new stuff which will
be in 11.50xC5. In this posting, I'm only going to give a
brief description of each of the enhancements. Later postings
will go into more detailed usage of each of the enhancements.
XML ATS/RIS
files
When a transaction can not be fully applied on a node, an ATS or RIS
file can be generated. This file will identify the rows which
could not be applied and the nature of the failure. Up till
now, the ATS/RIS file has been a simple text file. It does
contain the information which might be needed to perform analysis and
repair on the failed apply, but since it is in a text format, it is a
bit tedious to parse by a user application. In 11.50xC4, we
have made it possible to create an XML document instead of a basic text
file. By doing this, it is easier for applications to process
the ATS/RIS file, especially for a JAVA application.
This enhancement is activated by a new option on the cdr define server
and cdr modify server
commands.
-X
--atsrisformat=[text|xml|both] ATS and RIS file format
New Event
Alarms
We have added several new event alarms to ER. These include
state change alarms as well as alarms which fire as part of the
creation of an ATS/RIS file. By combining the XML ATS/RIS
files with the new alarms, it is fairly easy to add user written hooks
which will automatically perform analysis on any apply failure.
Delete Wins
With timestamp conflict resolution, if a row is missing on a target
node and an update operation is received, the apply will
convert the update into an insert operation. This may be fine
in most cases, but for others it means that a row may re-appear after
it has been deleted. In order to prevent such a re-appearance
of a deleted row, the replicate can be defined with conflict resolution
set to 'deletewins'.
Other than the fact that deletewins does not convert an
update into an insert, it behaves as timestamp conflict resolution.
Background
Sync/Check
The cdr check
and cdr sync commands
can now be executed as a background task using the server admin
component. Up till now the checn check and cdr sync commands
were done as a foreground task which tied up the invoker's session.
To request that the check and/or sync be done as a background
task, the user simply includes the --background
(-B) option on the cdr check and cdr sync command.
This options is available for both replicate and replset
commands. If doing a background sync/check, it is wise to
also use the check and/or sync as a named
task ( see below).
Named Tasks
By making the sync/check a named task, we make it possible to view the
progress of the sync or check command when it is performed as a
background task. To name the sync or check command, use the
--name=<task_name> option on the sync or check task.
To monitor the progress of the task, use the cdr stats check or cdr stats sync
command. (see below).
cdr
stats check/sync
The cdr stats check
and cdr stats sync
commands allow the user to monitor the progress of the running check
and/or sync named task. As part of the cdr stats command, we
also display an estimated completion time based on the rate that we are
processing data and the amount of data to be processed. Since
there is a repeat option as part of the command, we can also see a
running progress indication as the work is being done.
Parallel
sync/check of replsets
When performing a sync or check of replsets, we have added the ability
to perform the operation in parallel. We do not perform the check of
the individual replicate in parallel, but do perform the check/sync of
multiple replicates within a replset in parallel. This
enhancement is invoked by using the --process= ### (-p)
option in the cdr sync replset or cdr check replset command. The
parameter to the --process option is the number of processes which will
be spawned to perform the check or sync. As a rule of thumb,
the number of processes should not exceed the number of processors
available,.
In-Flight Data
One of the problems with performing cdr
check is that it does not consider in-flight operations.
It needs to be understood that on an active system, data will
always be in a state of flux and that data on various nodes are always
somewhat 'out of sync'. The degree of this is more or less
determined by the latency between each of the nodes. Up do
now, cdr check did not consider in-flight operations and would report
things as being out of sync when in fact the only problem was that the
update operation had simply not yet been received on one of the nodes.
We have added a 'recheck' for rows which we think are out-of-sync.
By default, we only recheck once and then only after one
second has passed. This recheck can be adjusted by using the --inprogress=### (-i)
where ### represents the number of seconds that we will attempt to
retry the check before we consider the row to be truly out-of-sync.
Using this option should reduce, if not eliminate the false
failures that would otherwise be reported from a cdr check operation
Repair
Verification
When performing a check with repair (cdr
check repl/replset --repair), we now
perform verification of any out-of-sync rows which were
repaired. This is subject to the --inprogress option (see
above). If we are not able to verify any of the repair work
which was done, then we will display the rows which could not be
repaired and successfully verified.
Role
Seperation
Up to now, all ER administrative commands had to be performed by user
informix. With 11.50xC4, we have extended that to support any
user having a DBA role.
Categories
: [ ER ]
Jul 04 2009, 02:59:43 PM EDT
Permalink
|
Constraints and ER
Constraints and ER
Constraints and ER
One of the characteristics
of ER is that it uses deferred
constraint checking. This means that constraints
are checked as part of the commit and not as part of the update
operation. Defered constraint checking has a huge advantage
for ER because that means we can dynamically increase the parallelism
by the apply while still supporting constraints such as referential
integrety. For instance, this allows us to apply the
transaction which might create a parent row and a seperate
transaction which creates a child row at the same time. All
we have to do is to ensure that 'parent transaction' commits prior to
the 'child transaction'. Since ER uses deferred constraint
checking, by coordinating the order of commits, we can make sure that
the parent row will exist prior to the deferred constraint checking
being performed by the child transaction. This significantly
improves the apply of ER as it allows the apply to take maximum
advantage of the resources which are available on the target node.
The transactions on the source were goverened by the
application. The application controlled the order of the
activities on the source node. That would mean that only the
resource usage was limited to what the application could take advantage
of. Usually this will mean serialized operations.
However, the target has no such restriction and thus can take
advantage of all resources. That's why generally ER can catch
up after an outage in a much shorter time than the origional activity
took.

Let's see how this works. Figure 1 shows what  will generally occur on a
source node when a session performs two transactions in which the first
will insert a parent row into one table and then commit. The
session then will open a subsequent transaction in which it will insert
a child row into another table in which there is a relationship with
the parent table (referential integrety). This requires that
two serialized transactions be executed.
On the other hand, figure 2 describes what will happen on the target
node. ER does not know that the original transactin was done
by a single session within two transactions, nor does it care.
The goal is to apply the operations on the target as quickly
as possible. That means that the ER apply transactions must
take advantage of all of the resources available. So what we
do is to sense that there is a referential integrety relationship
between the parent and child tables and then to guarentee that the
parent transaction will commit prior to the child transaction.
Since ER is using deferred constraint checking, that means
that the constraint rules are checked during the commit.
Therefore, the apply is able to overlay the transactions and
only serialize the commits.
It is generally best that systems using ER use constraints in general -
especially if there are unique indexes besides the primary key.
By making those unique indexes into unique constraints, ER is
better able to ensure that the transaction will be successfully applied
when the unique columns are being updated. In recent versions
of ER, we have added startup warnings if we detect that a unique index
exists which is not part of a unique constraint.
Categories
: [ ER ]
Jul 01 2009, 11:19:11 AM EDT
Permalink
|
HDR Performance Thoughts
HDR Performance
HDR Performance
Thoughts on HDR Performance
The Spice Must Flow
Indexed Spices
Half 'n Half
Minimize
the Secondary Checkpoint
Maximize Parallelism
High available data replication has been part of IDS since 6.0.
It has proved successful and is deployed at many customer
sites.
Yet I still occasionally hear comments about performance
issues
when HDR is turned on. So I thought I'd discuss some thoughts
about how to improve the performance of the HDR pair.
"The Spice
Must Flow"
Frank Herbert - 'Dune'
In the novel, "Dune", there is a phrase which is constantly repeated -
"The spice must flow". Well - a similar thing must happen
with
HDR, except with HDR it's "The Logs Must Flow". Let me
explain
by examining the following diagram.
 HDR
works by transferring the logs to the secondary where the recovery
component applys those logs. The secondary is in perpetual
recovery mode.
The logs are transfered to the secondary by copying the log buffer into
an HDR transmit buffer as part of the flush of the log buffer to disk.
If using synchronous mode of HDR, then the HDR Transmit
buffer is
immediately scheduled for transmission and the thread which caused the
log flush is held until the ACK of that transmission is received from
the secondary.
If using asynchronous HDR, then the HDR transmit buffer is not
scheduled
for transmission until either the transmit buffer is full or until it
has aged to the DRINTERVAL time limit.
While the HDR transmit buffer is sized the same size as the log buffer,
it is not a 1 to 1 relationship. If a log buffer is flushed
with
only part of the buffer being filled (as is often the case with
unbuffered committed transactions), then only part of the HDR transmit
buffer is used. When the next log flush occurs, we could then
copy the new log pages into the remainder of that HDR
transmit
buffer.
The transmission of the HDR transmit buffer is sent to the secondary
using a half-duplex protocol. That means that we can not send
the
next buffer until we receive an ACK for the previous transmission.
While this may increase the delay in sending a log buffer, it
also is necessary to ensure one of the main characteristics of HDR.
That is the assurance that the secondary can take on the role
of
the primary with no loss of committed transactions.
In order to receive the HDR transmit buffer, the HDR receive thread
must first obtain a buffer from the HDR receive buffer pool.
If
it can not obtain a buffer, it will wait until the recovery threads
place and empty buffer into the pool. When the receive thread
receives a buffer from the primary, it will ACK the buffer and then
queue the received buffer of log pages to the recovery component.
While it is a bit more complex, for our purposes we will
think of
the recovery threads consisting of a main recovery thread an a bunch of
worker recovery threads.
The main recovery thread will split the log page into log records and
then queue that log record to one of the worker recovery threads based
on the partition number of that log record. All of the log
records for a given partition will be applied serially by the same
worker recovery thread. Some of the log records are processed
by
the main recovery thread only after all of the worker threads have
processed the log records queued to them. These log records
can
be considered as a globally
serialized log record. The checkpoint
log record is one such log record.
After the main recovery component is finished with an HDR buffer, that
buffer is placed in the receive HDR buffer pool.
From this we can see that if we don't process things efficiently on the
secondary, then we will not be able to return an HDR receive buffer
into the receive queue quickly. If we can't get the receive
buffer into the HDR receive buffer pool quickly, then we will
not
be able to easily get an HDR receive buffer in which to receive the
data transmission. If we don't receive a transmission
quickly,
then we can't send the next buffer from the source due to half-duplex
transmission. If we can't send a buffer from the source, then
we
can't return that full HDR transmit buffer to the HDR transmit buffer
pool. If we can't return the HDR transmit buffer quickly to
the
HDR transmit buffer pool quickly, then the log flush logic is unable to
get an empty HDR transmit buffer easily. If the log flush
logic
is not able to easily get an empty HDR transmit buffer, then logging
will be blocked. A problem with the apply on the secondary
can back flow and impact the primary. So as Mr. Herbert said, "The
spice must flow".
Indexed
Spices
There is an additional consideration which needs to be considered and
that's what happens when an index is created. At the end of
the index creation, we transfer the index to the secondary server.
This is done by having the thread which created the index to
copy the newly created index into an HDR transfer buffer and sending
the index to the secondary. Because of the increased usage of
the transmission buffers by the index transfer, there can be some
degredation in the log transfer when an index is created. One
of the things that we did with IDS 11 was to implement a feature called
Index Page Logging. This allows the transfer of the log pages
to be done by placing the index pages into the log itself. To
avoid a long transaction, the index page logging is actually done
within multiple transactions.
While this does increase the log consumption, it also ensures that the
index transfer does not use all of the transfer buffers and allows
non-index log pages to be intermixed with the index log pages.
This tends to equalize the impact of the index transfer with
the user esql threads.
Half
'n Half
As was mentioned earlier, the transmission of HDR buffers to the
secondary uses a half-duplex protocol. This is absolutly
critical in order to support failover with no loss of committed
transactions, but is also critical if we want to do a 'flip-flop'
That is the case when the secondary becomes the primary and
the primary becomes the secondary. With IDS11, we implemented
the RSS secondary which does not use half-duplex protocol, but instead
uses full duplex with flow control. This eliminates the
impact of half-duplex protocol. Additionally the RSS node
does not block the log flush threads. Because of this, the
RSS node can be considered when willing to use asynchronized
mode with HDR.
Minimize
the Secondary Checkpoint Time
Since the checkpoint is a
globally serialized operation on the secondary, care must be
taken that the checkpoint does not cause a back flow. When
the checkpoint is applyed on the secondary, it is a blocking
checkpoint. The main amount of work of the secondary
checkpoint is involved in flushing all of the dirty buffers.
There are two main flushing algorthms that we use. The
checkpoint will use 'chunk writes' which involvs first sorting the
pages for a specific chunk. By doing that, we are able to
take advantage of write buddy bunching - and thus flush the buffers
quicker. The other form of flushing is called LRU writes and
are not as efficient as chunk writes, but will be performed inbetween
checkpoints to keep the buffer relative clean. You can
improve the time that it takes to perform the checkpoint by decreasing
the lru_min_dirty and lru_max_dirty items in the BUFFERPOOL. While this
will decrease the work done during a checkpoint, it will also increase
the work done inbetween checkpoints. Also to maximize
performance, you might want to consider making the number of lrus to be
about 2 times the number of CPUVPS. This should make it a bit
easier to perform LRU page flushing. We are currently
examining ways in which we can minimize the impact of the checkpoint on
the secondary server.
Maximize
Parallelism
We might get the log pages to the secondary really quick and with very
few bottlenecks, but if we can't apply the log records on the secondary
as fast on the primary, then we will run into back flow. So
it is very important to take advantage of all of the resources that the
secondary so that the performance will at least match the primary.
That means that if we have 12 CPUVPS on the primary, we
probably need to have 12 CPUVPS on the secondary. But there
is an additional consideration. We need to be able to utilize
the parallel recovery apply to make it easier to maintain a balance.
The number of recovery threads is determined by the onconfig parameter
OFF_RECVRY_THREADS. As a rule of thumb, there should be at
least 3 times the number of recovery threads as there are CPUVPS.
The reason is that 1) the log records are spread across all
of the recovery threads and 2) there is an increased
probability of having to do a read into the buffer in order to process
that log record. If we only have as many recovery threads as
CPUVPS, then there is going to be a lot of time waiting for read
completion. By increasing the number of recovery threads to
at least 3X the number of CPUVPS, then we can increase the probability
of being able to work on another log record while waiting for the IO
completion on another.
If there are indexes on a table, then we should make sure that the
index is located in a different partition than the data.
(i.e. IDS 6 style of indexes). The reason for this
is that if we have version 5 style of indexes then the index is in the
same partition as the data. Since the passing out of the log records is
based on the partition, then if the index is in the same partition as
the data pages, then the index log records and the data log records
have to be processed by the same recovery thread - which decreases the
ability to utilized all of the resources on the secondary.
If the indexes are in the same partition as the data pages, then the
apply is done serially.
Another thing that can be done to improve apply is to partition tables
which are highly updated. Again, by adding fragmentation of
both the data and the indexes, we can maximize the degree of
parallelism.
Categories
: [ HDR | MACH11 ]
Jun 24 2009, 09:53:51 PM EDT
Permalink
|
|
 |
| S | M | T | W | T | F | S | | | | | 1 | 2 | 3 | 4 | | 5 | 6 | 7 | 8 | 9 | 10 | 11 | | 12 | 13 | 14 | 15 | 16 | 17 | 18 | | 19 | 20 | 21 | 22 | 23 | 24 | 25 | | 26 | 27 | 28 | 29 | 30 | 31 | | | | | | | | | | | Today |
|