IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
developerWorks  >  Blogs  >   developerWorks

author The Replication Roundtable ---replication solutions available with the Informix Dynamic Server

Madison is a Senior Technical Staff Member (STSM) with IBM and is the replication architect for the Informix Dynamic Server. He has not only been responsible for the development of much of the ER and MACH11 functionality, but he has also played significant roles for non-replication functionality such as large chunk and network encryption. He lives in Flower Mound, Texas with his wife, Colleen.



Tuesday May 06, 2008

Cheetah 2 is Released

Cheetah2
Cheetah 2 is Released




By now you probably have heard about the release of IDS 11.50 (Cheetah 2).  The release was announced at the recent IIUG user conference which was held in Lenexa, Kansas  on April 28 - April 30.

I have been heavily involved in the development of Cheetah2, which is the main reason that I have not been active in the blobsphere world recently.

There are many cool features which are part of Cheetah2, not the least of which is the expansion of work done in  IDS 11 as part of MACH11.  I will be describing many of these new features in detail over the next few weeks, but though that I'd do a quick overview in this blog entry.

Check out some of the new functionality at the web site ....  http://publib.boulder.ibm.com/infocenter/idshelp/v115/index.jsp

While mentioning external sites, I need to mention that IBM is currently taking a survey which should give us insite into your business priorities and experiences with IBM software.  To take this survey check out http://www-306.ibm.com/software/data/info/consumability-survey/ 


Expanded Ports


IDS 11.50 has now been ported to the MAC.  This was announced at Mac-World - held in January.  

New Communication Protocols

We have extended support for SSL.  In the past we supported encrypted communications, but now support the complete SSL suite.  Also, we now have support for DRDA and JCC.  This will enable IDS to more easily support the same clients as is currently supported by DB2.

Single Sign On

We now have the ability to support a common authentication for multiple IDS servers.

Updatable Secondary Nodes

With IDS 11, we expanded the secondary types from the single HDR secondary node to include multiple secondary nodes (RSS)  as well as a secondary running on top of a shared disk (SDS).  This created the MACH11 cluster.  In IDS 11.50, we have added to the usability of the secondary node by making it possible to perform update activity on those secondary nodes - be they HDR, RSS, or SDS nodes.  This means that the  investment that has been made in availability solutions can  be used in much the same way as the normal primary node.

Expanding the Isolation of the Secondary Node


In the past, the read isolation on the secondary node was restricted to dirty-read isolation.  With IDS 11.50, we have expanded this to include committed read and last committed read isolation.  With the release, this is restricted to SDS nodes only, but will be expanded to RSS and HDR in the near future as more testing is done.

Connection Virtualization

We have added support for connection virtualization by implementing a connection manager.  The connection manager monitors the various nodes within the MACH11 cluster to determine the type of node, workload, availability, etc.  The customer can configure the connection manager by describing a class of service.  When the client application connects to the cluster, it connects to the defined class of service rather than to a specific server.  The connection manager will then route the connection to the best choice for that classification , based on current workload.  

Failover Arbitrator

Part of the connection manager is to perform failover detection and transfer of functionality.  This is done by a simple set of rules which are part of the Connection Manager configuration.

OAT Enhancements

The Open Admin Tool has had quite a few new enhancements.  It has had an design-makeover of the overall layout and presentation.  It really looks a lot better than in the past.  In addition to the normal monitoring interfaces, it also now has some autonomics such as update statistics automation and alert management

Addition to SQL

There are several new things which we have added to the SQL engine.  One of the key things is a row versioning indicator which can be used to support optimistic locking techniques.  Also we have added the ability to support dynamic query construction within basic SPL.    





Categories : [   IDS  ]

May 06 2008, 05:00:00 PM EDT Permalink



Friday February 15, 2008

Running MACH11 on a single machine

MACH11 on a single server
Setting up MACH11 on a Single Server

 
Well I owe everyone an apology.  It's been way too long since my last post.  I've been rather busy lately trying to get all of the stuff into Cheetah2 (the upcoming release), but still I should have posted something.  Sorry...

Many of you may have heard in a recent "Chat with the Labs" that we are in the beta process for Cheetah2.  Also, Jerry Keesee (the director of IDS development) mentioned that we would be starting an open beta shortly.  Some of you may have even already joined the open beta and are currently testing with Cheetah2.  So I thought that I'd spend a bit of time today describing how you can setup the MACH11 environment on a single server, be it HDR, RSS, or SDS.  This might be a good thing to discuss at this time because there is some new functionality for the MACH11 environment in Cheetah2.

I do most of my development on a Linux workstation.  It does not have a fancy shared disk subsystem.  It simply has the factory installed IDE disk.  I also do quite a bit of development on http://www-128.ibm.com/developerworks/blogs/page/roundrepmy laptop using VMWare running RedHat 4.  (Yes - you can run a MACH11 cluster on a virtual server.)  I can guarantee you that my laptop doesn't have any shared disk subsystem.

OK - what it the trick to make this work?  Well it's fairly simple - use relative path names.  The same technique will work on basic HDR on IDS versions released prior to IDS 11.

Let's see how I set up my environment for a primary node which supports an SDS node and an HDR secondary.
Directory Layout
First of all, let's look at the directory layout.  On my development system, all of my chunks are located   under /db/IDS.  Under that directory, I set up a different directory for each of my servers.  In my case, I name my servers  serv1, serv2, serv3,  serv4, etc.  That means that I have a directory /db/IDS/serv1, /db/IDS/serv2,  /db/IDS/serv3, etc.  Then within each of these directories, I set up my chunk files.  For instance, /db/IDS/serv1 would look something like what is displayed to the right.  (Of course I'm lazy and have a script which sets all of the up.)

So far there doesn't appear to be anything unusual about this. This is pretty much how most people will have set up a testing system.  But then let's examine the onconfig and see what it looks like.
Onconfig File


The first thing that we might notice is the ROOTPATH parameter..  I'm not using the fully qualified path name /db/IDS/serv1/rootchk.  Instead I'm only using the name of the file.   OK - so what does this mean?  Well to make this work, the only restriction is that when starting onlinit, I must first be in the instance's  directory of /db/IDS/serv1.  So in order to bring up the server, I first execute cd /db/IDS/serv1 and then execute oninit -iyv.  By using relative path names, I'm able to run with any of the MACH11 server types, be it HDR, RSS, or SDS.


Onconfig Items Needing Change For Relative Paths

Now let's examine some of the other key parameters which might need to be modified.  To support shared disk secondary nodes (SDS), we might want to  modify SDS_TEMPDBS and SDS_PAGING to use relative path names.  In this example, I'm using the file sdstemp as my temporary DBSpace for shared disks and the files page1/page2 for my SDS paging files.  Also notice that I set my message log file to the file name log.  

OK - now let's see how the shared disk is set up.  I actually have the option to simply use the exact same path name on the SDS node as I use on the primary.  If all I want to do is to set up a primary and SDS nodes, then there is no reason to use relative path names.  I simply have to use the same entry for ROOTPATH on the SDS node as I do on the primary.  On Windows, this might be the easiest way to set up a MACH11 cluster.  However in my environment, I want to be able to set up both SDS and HDR/RSS.  Since running HDR and RSS on the same machine will require using relative paths, then I will also set up SDS using relative path name.  So let's see how the 'instance directory' of the SDS node is set up.
Instance Directory for SDS node

Basically the only thing that we have to do to use relative path names for the SDS node and to have a SDS instance directory is to use links to point to where the primary chunks are located.  The SDS temporary dbspace chunks, paging file, and message log files are dynamically created as the SDS node is started.




It is a little more tricky to use relative path name on Windows because then the database server is run as a Windows service.  So what must be done is to bring up the instance by using the starts command in the correct directory rather than using the auto start functionality of the Windows service manager.  You can use the Windows instance manager to get things set up, but will not want to actually initialize the server.  Instead you will want to edit the instances onconfig file to use the relative path names, get into the correct instance directory, and then run starts <instance_name> -iy.  This will result in the same effect as having run oninit -iy on a UNIX type of system.

Additionally, on Windows there an issue with setting up HDR and/or RSS because the physical recovery of the server will try to also startup the engine.  Physical recovery is performed by running ontape -p and is a normal step used to initialize the HDR/RSS secondary.  Since ontape -p will automatically start the server, there can be a problem with oninit not being in the correct directory because it is not started in quite the same way on Windows as on UNIX.  To get around this issue, I've used the following technique in the past to instantiate the HDR secondary on Windows.

On the Primary On the Secondary
onmode -d primary <secondary node>
onmode -c
onmode -ky
Copy chunks from primary instance directory  to the secondary instance directory
Perform a physical recovery (oninit -PHY)
onmode -d secondary <primary node>

We don't document the oninit -PHY option and don't encourage it's usage in a normal production environment.  It performs a physical recovery of the server which means that we only recover up to the checkpoint.  We do not perform any roll forward of the logical logs.  So in normal production environments, it's misuse can cause problems and possible loss of data - so if you should attempt to use this technique to set up an HDR environment on Windows, be aware of this.


Categories : [   HDR  |  MACH11  |  RSS  |  SDS  ]

Feb 15 2008, 10:20:00 AM EST Permalink



Tuesday December 11, 2007

cdr check correction

cdr_check_correction
Correction to the cdr check Document in Developers Works

In the developer's work article for enabling the cdr check functionality, there is an error in the compile script for AIX.  This article is located in http://www.ibm.com/developerworks/db2/library/techarticle/dm-0604pruet/  and is titled "Enable 'cdr check' functionality within IBM Informix Dynamic Server".  To enable cdr check in IDS 10, you must first install some UDRs to enable checksum generation.  It is not necessary to do this in IDS 11 because those UDRs are built into the product itself.  

The part in error is  the following statement in the make file for AIX.  Instead of having...

 $(NM) -X64 -g checksum.o | sed "/ U /d" | cut -f1 -d" " | sed "/.o:/d" | sed -e "s/^\.//" | sort -u  < checksum.exp

we should have

 $(NM) -X64 -g checksum.o | sed "/ U /d" | cut -f1 -d" " | sed "/.o:/d" | sed -e "s/^\.//" | sort -u  > checksum.exp

That would mean that the correct make file for AIX64 would be.



#
# Compiler/Linker flags specific to AIX
#

CC = cc
LD = ld
NM = nm
CFLAGS = -q64 -shared -qchars=signed -D_H_LOCALEDEF -DINFX_ANSI -D_LARGE_FILES
PICFLAGS = -lm
SOFLAGS = -G -b64 -bnoentry

LIBSO = checksum.so

TARGETS =${LIBSO}

.SUFFIXES: .c .o


all: $(TARGETS)

checksum.so: checksum.c
$(CC) $(CFLAGS) $(PICFLAGS) -I${INFORMIXDIR}/incl/public -I${INFORMIXDIR}/incl/ -c checksum.c
$(NM) -X64 -g checksum.o | sed "/ U /d" | cut -f1 -d" " \
| sed "/.o:/d" | sed -e "s/^\.//" | sort -u > checksum.exp

$(LD) $(SOFLAGS) -bE:checksum.exp -o checksum.so checksum.o -lm -lc
chmod 755 checksum.so

clean::
rm -f ${LIBSO} *.o




Categories : [   cdr  ]

Dec 11 2007, 09:00:00 AM EST Permalink



Tuesday November 20, 2007

Identifying the Server Type

Identifying the Server Type
Identifying the Server type

With introduction of SDS and RSS, one can have a complex topology of IDS cluster. The DBA's scripts as well as applications need find out whether the server is stand-alone, Primary or Secondary. That makes it important to understand the programmatic interfaces available to find type of server being accessed.

Following are the ways to check the type of the server :

Administrative Utilities

The output from 'onstat -' prints the type of the server. For all secondary types, it will say Read-Only with the type HDR, RSS or SDS. For each of the secondary server type, there are onstat options to get more details

  • 'onstat -g dri' prints HDR information
  • 'onstat -g sds' prints SDS information
  • 'onstat -g rss' prints RSS information

Sysmaster Database

On primary, view sysha_nodes view contains all the server names with types. On all secondary type servers, it has a single row with primary server's entry.

Esql/c Client

A warning is set in SQLCA when the client connects to any secondary type server. The sqlwarn.sqlwarn6 is set to 'W'. Also the SQLSTATE is set to '01I06'. Application can look at this warning flag to determine whether the server is read-only or not.

JDBC Client

The Informix JDBC driver provides more direct APIs to check the server status. The Connection object supports three methods isReadOnly(), isHDREnabled() and getHDRtype().

  • isReadOnly() : Returns true if the active server is a secondary server
  • isHDREnabled() : Returns true if both servers in the HDR pair are available. Returns false if one of the servers is unavailable.
  • getHDRtype() : Returns primary or standard for a primary server, secondary for a secondary server

UDRs

The C UDRs can use mi_hdr_status() API to check the type of the server where the UDR is being executed. The return value should be checked for bits MI_HDR_PRIMARY and MI_HDR_SECONDARY. These macros are defined in $INFORMIXDIR/public/milib.h There is no direct way from the SPL or Java routines. One can query against sysmaster tables mentioned above.

Sysdbopen() UDR

IDS 11.10 supports two DBA controlled routines sysdbopen() and sysdbclose(). These procedures are run by server on the behalf of the users when the try to connect/disconnect to/from a database. One can create a sysdbopen() routine that checks the server type (using the mi_ API or sysmaster query) and restrict databases or users on secondary servers.




Nov 20 2007, 06:09:14 PM EST Permalink



Friday November 16, 2007

DDRBLOCK

DDRBLOCK
DDRBLOCK

It sometimes happens that quite useful fixes and enhancements make it into a release but remain little-known. A few such fixes and enhancements made it into the 11.10xC2 server; together, these enhancements make the management of the CDR_QDATA_SBSPACE configuration and of DDRBLOCK mode much easier and more tenable than in the past.

The IDS server writes to logical log files in a circular fashion, overwriting older log files when a new log file needs to be written to and more than LOGILES files (as specified in the $INFORMIXDIR/etc/$ONCONFIG configuration file) have been written to. DDRBLOCK occurs when new transactions writing to the log come dangerously close to wrapping the log space around and overwriting old logs that Enterprise Replication has yet to process. In older servers, if the system ever entered DDRBLOCK mode, it could be very difficult to get the system out of DDRBLOCK mode without restarting oninit.

More recent releases of Enterprise Replication -- certainly, version 10 and later -- should rarely enter DDRBLOCK mode, unless the system is severely misconfigured. An example of a dangerously misconfigured system would be one with too few log files, especially if some of the log files are quite large while others are quite small. With such a configuration, even a small hiccup when Enterprise Replication processes log entries can cause DDRBLOCK mode, or even worse, log wrap. If log wrap occurs, that is, if new transactions overwrite entries that Enterprise Replication has yet to process, Enterprise Replication shuts down and data becomes unsynchronized among servers in the replication system.

One condition in which Enterprise Replication can still enter DDRBLOCK mode even in an otherwise well-configured system is when a destination site remains inaccessible for an extended period of time. If this happens, the Reliable Queue Manager (RQM) send queue will save transactions that include that site in its destination list in stable storage. If the spool space fills, the oninit server will likely enter DDRBLOCK mode, because Enterprise Replication cannot stably store transactions in its send queue and therefore can no longer advance the replay position, the oldest point in the logs that Enterprise Replication needs to access.

As an example, I have configured a small two-server replication system. I configured the IDS instance at which I will be generating transactions with too few logs and too little send queue stable storage and used the 'cdr suspend serv' command to suspend the other server. Since transactions cannot flow to the destination server, transactions quickly start to accumulate in the send queue:

[pinch-cdrtempmurre] (pinch)  110 %  onstat -g rqm sendq | egrep '^ Txns'
 Txns in queue:             18
 Txns in memory:            7
 Txns in spool only:        11
 Txns spooled:              11
and as I configured very little send queue spool space, the spool space immediately fills up, as shown in the message log:
10:44:47  CDR QUEUER: Send Queue space is FULL - waiting for space in CDR_QDATA_SBSPACE
In this case, Enterprise Replication will also raise an alarm of severity 4 and class 31.

Since Enterprise Replication cannot advance the replay position, the IDS instance also enters DDRBLOCK state, as shown by the "Blocked:DDR" line in the following output:

[pinch-cdrtempmurre] (pinch)  129 % onstat -g ddr | head -10

IBM Informix Dynamic Server Version 11.10.F       -- On-Line -- Up 00:26:03 -- 78772 Kbytes
Blocked:DDR 

DDR -- Running --  

# Event  Snoopy   Snoopy   Replay   Replay   Current  Current 
Buffers   ID      Position  ID      Position   ID     Position
2064      4       1ee4454   3       74f018   12       2ad000 
We can see that the replay log id is 3, whereas the current log id to which IDS is writing transactions is 12. The fact that log 12 is the current log is also displayed by the onstat -l command:
[pinch-cdrtempmurre] (pinch)  132 % onstat -l | grep C | grep -v CDR
451f2c30         2        U---C-L  12       1:31763              9000      685     7.61
I configured my example instance to have only 10 logical log files, so if we cannot reuse logical log 3 and are already at log 12, we need 12 - 3 + 1 or all 10 logical log files. Small wonder the server is in DDRBLOCK mode!

The send queue stable storage area is configured via the CDR_QDATA_SBSPACE configuration parameter. 11.10xC2 and later include an addition to onstat that allows the sbspaces configured to CDR_QDATA_SBSPACE to be monitored very easily. The command is onstat -g rqm sbspaces:

onstat -g rqm sbspaces

IBM Informix Dynamic Server Version 11.10.F       -- On-Line -- Up 00:29:41 -- 78772 Kbytes
Blocked:DDR 


RQM Space Statistics for CDR_QDATA_SBSPACE:
-------------------------------------------
name/addr      number    used        free        total       %full   pathname
0x46581c58     5         311         1           312         100     /tmp/amsterdam_sbsp_base
amsterdam_sbsp_base5     311         1           312         100     

0x46e54528     6         295         17          312         95      /tmp/amsterdam_sbsp_2
amsterdam_sbsp_26        295         17          312         95      

0x46e54cf8     7         310         2           312         99      /tmp/amsterdam_sbsp_3
amsterdam_sbsp_37        310         2           312         99      

0x47bceca8     8         312         0           312         100     /tmp/amsterdam_sbsp_4
amsterdam_sbsp_48        312         0           312         100     
In the past, the information returned via the onstat -g rqm sbspaces command was available, but you had gather it by looking at the the CDR_QDATA_SBSPACE values and then manually extracting the information relevant to the CDR_QDATA_SBSPACE spaces from the onstat -d output. Imagine doing this in a "real" system with dozens of dbspaces!

If CDR_QDATA_SBSPACE space starts to run low, you can either add more chunks to an sbspace already in the CDR_QDATA_SBSPACE list, or, starting with the 11.10xC2 release, you can add a new sbspace to the CDR_QDATA_SBSPACE list.

For example, say I have created (via onspaces) a new sbspace mynewcdrsbsp:

[pinch-cdrtempmurre] (configparam)  157 % onstat -d | grep mynewcdrsbsp
47bce508         12       0x68001    12       1        2048     N SB     informix mynewcdrsbsp
47bce6a0         12     12     0          1000       702        702        POSB  /tmp/mynewcdrsbsp
I can then add that space to the list of CDR_QDATA_SBSPACE spaces via the cdr add config command.
[pinch-cdrtempmurre] (configparam)  158 % userid informix cdr add config "CDR_QDATA_SBSPACE mynewcdrsbsp"
 WARNING: The value specifed updated in-memory only.
I can easily verify what sbspaces are configured via onstat. As you can see, mynewcdrsbsp is there:
[pinch-cdrtempmurre] (configparam)  159 % onstat -g cdr config CDR_QDATA_SBSPACE 

IBM Informix Dynamic Server Version 11.10.F       -- On-Line -- Up 00:39:38 -- 86964 Kbytes
Blocked:DDR 
CDR_QDATA_SBSPACE configuration setting:
              amsterdam_sbsp_base
                 amsterdam_sbsp_2
                 amsterdam_sbsp_3
                 amsterdam_sbsp_4
                     mynewcdrsbsp
and Enterprise Replication is spooling transactions to the new sbspace. In fact, it's already 99% full.
[pinch-cdrtempmurre] (configparam)  162 % onstat -g rqm sbspaces

IBM Informix Dynamic Server Version 11.10.F       -- On-Line -- Up 00:51:59 -- 86964 Kbytes
Blocked:DDR 


RQM Space Statistics for CDR_QDATA_SBSPACE:
-------------------------------------------
name/addr      number    used        free        total       %full   pathname
0x46581c58     5         311         1           312         100     /tmp/amsterdam_sbsp_base
amsterdam_sbsp_base5     311         1           312         100     

0x46e54528     6         312         0           312         100     /tmp/amsterdam_sbsp_2
amsterdam_sbsp_26        312         0           312         100     

0x46e54cf8     7         310         2           312         99      /tmp/amsterdam_sbsp_3
amsterdam_sbsp_37        310         2           312         99      

0x47bceca8     8         312         0           312         100     /tmp/amsterdam_sbsp_4
amsterdam_sbsp_48        312         0           312         100     

0x47bce6a0     12        696         6           702         99      /tmp/mynewcdrsbsp   
mynewcdrsbsp   12        696         6           702         99      

So what about DDRBLOCK mode? In practice, by far the likeliest cause for entering DDRBLOCK mode is that a destination server remains unavailable for an extended period of time. (In this example, I have simulated that condition by suspending the destination server.) If you expect the destination server to become available in a reasonable amount of time and you have enough disk space, you can add more space to the CDR_QDATA_SBSPACE parameter as in this example. Because Enterprise Replication raises an alarm of severity 4 and class 31 when it runs out of send queue spool space, you could even write an alarm handler to automate this task.

What if you expect a destination server to become unavailable for an extended period of time, a period longer than you expect can be handled by spooling the send queue to disk? You will have little choice other than to remove the unavailable server from the replication system and to resynchronize data once it becomes available again; but that is the topic of a future blog entry.




Nov 16 2007, 12:53:56 PM EST Permalink



Monday November 12, 2007

An Always-On HDR

An Always-On HDR
An Always-On HDR

IDS’s HDR technology is the cornerstone to every high-availability environment. If you need your data available at all times (and who doesn’t?) you must plan for unexpected outages (e.g. network, hardware or operating system failure). HDR addresses this by allowing you to have a copy of your primary server. With DRINTERVAL set to -1, you can guarantee that your primary and secondary servers are in complete synchronization. Problem solved.

But what happens if one node of your HDR pair goes down? You’re no longer operating with high-availability protection. How much more risk can you tolerate? Sure you’ve got logs saved, but clients need the data faster than a log restore.

With the release of version 11, IDS can be configured in such a way that you can have HDR always on. In other words, you can create an environment where you step back into HDR as soon as a failure causes you to step out of it. How do you do this? Use a cluster of an HDR pair plus RSS.

A Remote Standalone Secondary (or RSS) node operates very similarly to an HDR secondary except it is not in sync with the primary. It offers many advantages when used at a remote location (i.e. one with high network latency), but in our context let’s use one locally. One characteristic of an RSS server is that is can become an HDR secondary while on line! An HDR secondary in turn can become an RSS node. Now we’ve got all the pieces in place, so let’s explain the ring.

The simplest cluster has three nodes: an HDR primary, an HDR secondary, and an RSS node. Our goal is to always have HDR on. So when an event occurs that causes our HDR pair to break - one of the nodes fails - that must trigger a second event that reestablishes an HDR pair. The second event can occur manually or programmatically. Since the cluster has only three nodes, let’s consider the three failures that could occur and what to do.

Scenario 1: Primary fails

  1. Make the HDR secondary your new HDR primary
  2. Make the RSS node your new HDR secondary
  3. Fix your old primary and bring it online as an RSS node

Scenario 2: Secondary fails

  1. Make the RSS node your new HDR secondary
  2. Fix your old HDR secondary and bring it online as an RSS node

Scenario 3: RSS node fails

  1. Fix your RSS node
OR
  1. Add a new RSS node

Regardless of which node fails, we have the means of reestablishing an HDR environment. Further, since RSS technology is one-to-N, multiple RSS nodes can be added to the cluster giving you more options for each scenario.

What about scenarios when more than one failure occurs at a time? These are obviously more complex and their solutions depend on what types of failures occur. Redundant machine parts and network infrastructure, interconnected network nodes, and our high-availability “ring” will mostly likely play significant parts.

Adding an RSS to an HDR pair can give at least a second layer of high data availability, and as explained above can at best make HDR always on.



Categories : [   HDR  |  RSS  ]

Nov 12 2007, 12:23:01 PM EST Permalink



Thursday November 08, 2007

Catching and Cleaning

Catching and Cleaning a Grouper
Getting Info about the Grouper

Now that we know about the grouper in general, let's look at how we can get specific information about what it's doing.

How do we do that? Well if you answered "onstat" … you're right! onstat -g grp, with an optional modifier, is the gateway to the inside of the grouper. In typical onstat style running the basic command, onstat -g grp, gives you a sampling of various information also accessible from other subcommands. Let's pick just one piece to focus in on. The line "Eval thread interface ring buffer pending entries" indicates how much work is outstanding for the evaluator threads. The fanout thread puts items on the "ring buffer" and the evaluator threads take things off. This can help you decide the best number of evaluator threads for your systems.

For information about the evaluation phase two commands are particularly good. onstat -g grp E gives information about each evaluator thread including the number of updates they have processed. Secondly, onstat -g grp P shows for which tables the grouper is evaluating rows.

For information about the compression phase, check out onstat -g grp M. This keeps a running average of the time been spent on compression and shows what compression strategies are currently being used. onstat -g grp Mz resets these statistics.

Lastly, for info about the copy phase try running onstat -g grp T. This command tells you details about the last transaction copied out as well the total amount of transactions processed.

Keep these onstat commands in your tackle box for the next time you wanna catch and clean a grouper!




Nov 08 2007, 06:42:27 PM EST Permalink



Monday November 05, 2007

Fishing Arround

grouper
Grouper


Grouper Threads
Phases of the Grouper Evaluator

No - this is not about some fish.  Rather, this is about the process within IDS which regroups the logical log records into a transaction for replication, evaluates the rows to determine what should be replicated and where it should be replicated to, and then places the replicated transaction into the send queue for transmission to the target servers.


Grouper Threads

The grouper is composed of two parts.  The first part consists of the grouper fanout thread (CDRGfan).  The purpose of the grouper fanout thread is to
  • Receive reconstituted log records from the log snooper (ddr_snoopy)
  • Regroup the transaction (i.e. attach the log record to the appropriate transaction)
  • Pass the log records to the grouper evaluator for evaluation
  • Determine if the transaction is consuming too many resources and needs special treatment such as it's own memory pool and/or needs to be paged 
  • Place the transaction into the grouper serial list.  This is done when the commit record is processed and is used to ensure that the transaction is placed into the send queue by commit order.

The second part of the grouper is the grouper-evaluator.  The grouper-evaluator consists of several threads whose names begin with "CDRGeval__".  The purpose of the grouper evaluator is to
  • Evaluate the log record to determine if it is a candidate for prorogation
  • Reconstitute the transaction from the logical logs
  • Compress the transaction by the removal of any duplicate operations on the same row
  • Determine the original 'before image' and the final 'after image' of any update operation
  • Queue the replicated transaction for transmission to the various targets
  • Record any deleted rows in the shadow delete table 
It is fairly obvious that the Grouper-Evaluator is a fairly critical component of ER.  Because of that it is rather critical that it be as streamlined as possible.  Otherwise, it would not be able to process the log records quickly which would cause a back flow into the log snooping process.  And a back flow into the log snooping would cause a significant impact on overall latency.  So it is rather important that grouper be able process the log records quickly and avoid having to do disk IO.  


Phases of the Grouper Evaluator


Evaluation Phase

The first phase of the grouper evaluator is the evaluation phase.  Duing this process one of the grouper evaluator threads will examine the log record to determine if it is a candidate for replication.  If it is not, then the row is immediatly released.  Generally the grouper will evaluate rows as the transaction log buffers are being flushed to disk.  That means that there is generally no physical IO involved in obtaining the rows.  This means that if the transaction performs operations on multiple rows, it is possible that grouper may have evaluated the log records before the commit for the transaction has occurred.  This would generally be the case if the commit of the transaction is in a different log buffer than the other operations of the transaction.  However, the grouper does not place the transaction into the send queue until it has processed the commit record and all rows of the transaction have been evaluated.

The grouper evaluation is performed in parallel.  By that I mean that one log record of the original transaction might be evaluated by one of the grouper threads while another log record of the same transaction can be evaluated by another thread.  This makes it possible for the evaluator to remain fairly current with the current log position.  

Compression Phase

Once the commit record has been processed, the grouper goes through a compression phase.  This involves determining all of the operations for a given row within the transaction and eliminating any unnecessary operations.  For instance, if a row was updated multiple times within the transaction, the duplicate operations will be eliminated and only the original before and after image will be saved.  If a row was inserted in a transaction and then deleted within that same transaction, then it will not even be replicated.  This process reduces the overall size of the transaction which will be placed into the send queue.  

Additionally, the compression phase is a requirement for transmitting the correct operation.  There are many examples where the operation can not be transmitted by using the same operation as was performed on the source.  For instance, suppose a replicate was defined with a filter - say "select * from payroll where  status_column = 4".   Now suppose the following command was issued

update payroll set status_column = 4 where emp_no = 23412;

Unless the before image of the row had a status_column of 4, then the target would not have the existing before image as the before image was not a member of the replicated set of data.  Therefor, when the update operation was replicate, we would need to replicate it as an insert, not as a delete.

Likewise, suppose the following statement was issued:

update payroll set status_column = 3 where emp_no = 23412;

If the before image of the row had the status_column set to 4, then the update operation would be removing the row from the set of replicated data because the filter used to define the replicate is no longer a 4.  That means that the update operation would need to be transmitted as a delete operation.

Copy Phase

The final phase of the grouper evaluator is the copy phase.  During this time, the replicated transaction is placed into the send queue for transmission to the target nodes.  Although the transaction may be transmitted to multiple targets, it is placed into the send queue only once.   The transaction is placed into the queue in a 'stream' format - which basically means that it is put into a network-independent format.  That means that objects such as user defined types are converted into a stream for transmission to the target nodes.

Configuring the Grouper Threads

The onconfig parameter used to configure the grouper evaluator threads is CDR_EVALTHREADS  x,y where 'x' is the number of threads per CPUVP and 'y' is a number of extra threads.  The default is 1,2.  I personally think that 1,2 is a good setting for the number of evaluator threads.  The theory is that we want to evaluate the row as quickly as possible.  Since the majority of work that the grouper evaluator threads do is very light weight simple evaluation of the log records, there is little cost with having one per CPUVP.  Also this makes it easier to maintain a balance between the logging work and the consumer of the logs.  

However, there is still the problem of having to maintain the local shadow delete table.  Maintaining the delete table does involve some blocking activity because we have to perform IO to the delete table itself.  That will cause the grouper evaluator thread to go into a wait state, which can lead to lagging behind the consumption of the logs.  That's why it still makes sense to have a couple of extra evaluator threads.







Categories : [   ER  |  Grouper  ]

Nov 05 2007, 11:00:00 PM EST Permalink



Monday October 22, 2007

Monitoring the Queue

Monitor_the_queue
Monitoring the Queue

Overview
onstat -g rqm sendq output
Current Statistics
Historical Statistics
Progress Table
RQM Handle


Overview


Well - made it safely back home from the IOD conference in Las Vegas without losing too much money in the casino.  The really cool thing was that while waiting at the airport, I played some slot machines and won $45.00.  Guess I should have taken a later flight so I could have played longer... ;-)

I mentioned last week that I would be posting some pictures from the conference.  Unfortunatly, my pictures were not very good, so I would suggest checking out these pictures instead.  The conference was great.  There were arround 8000 folks attending and IDS had a strong presence.

Well - back to business.  In a previous entry I gave some thoughts about sizing the queue.  In this entry I'm going to describe how to monitor the queue.There are two main ways to monitor the queues, through onstat and through the sysmaster database.  In this blog entry we will focus on  the onstat -g rqm command.

The Reliable Queue Manager (rqm) is the subcomponent of ER which is responsible for the physical management of the queue.  It is responsible for things such as determining when an item can be removed from the queue, what thread is referencing a queue item, cursors on the queue (rqm handles), when an item must be spooled to disk  (smartblob), etc.  

There are several options which can be used with the onstat -g rqm command.  The following table describes these options:


(Options to onstat -g rqm)
Option Description
<nothing>  (i.e onstat -g  rqm) Display information about all queues
SENDQ Display information about the send queue.  The send queue is used to transmit transactions to target servers.  These transactions might be originating on the local node or might originated on a remote server in the case with hierarchical routing.
RECVQ Display information about the receive queue.  The receive queue is used hold the replicated transaction as it is received on the target but has not yet been applied on the target table by the datasync threads.
CNTRLQ Display information about the control queue.  This queue is used to manage control messages such as replicate definitions, server definitions, start replicates, etc.  Items placed in the control queues are always copied into stable storage.
ACKQ Display information about the ACK queue.  This queue is used to hold acknowledgments before they are sent to the source node.
SYNCQ Display information about the sync queue.  This queue is only used as part of the define server and then only to transmit the syscdr database to the newly defined node.
SBSPACES Display information about the sbspaces used to contain the stable storage of the queues.
FULL Display the transaction headers for each of the transactions within the queue which are currently in memory.
VERBOSE In addition to the transaction headers, display information about each of the rows  
for transactions in the queues.
BRIEF Display a short summary of what is contained in the queue.

In this posting, we are going to limit ourselves to onstat -g rqm SENDQ.

onstat -g rqm sendq output

There are several sections in the onstat -g rqm sendq command.  The following table describes these sections.

  1. The Summary Section.

    This section contains a summary of the queue.  It is further broken into two sections.

    1. The current summary
      This section contains the current statistics about the queue. 

    2. The historical summary
      This section contains the historical totals of the queue. It contains information such as the total number of transactions which have been queued as well as the maximum size that the queue has grown to

  2. The Progress Table Section


    This section contains the 'progress' of the queue.  By that we mean that this is what is tracking what has been sent to what remote server, and what has been ACKed from the remote server.  This section is further broken into two sections

    1. The progress table summary
      This contains describes what table on disk is used to contain the progress table.  It also describes how often the progress table is flushed to disk.

    2. The target/replicate progress information
      This contains information on which transactions have been sent to the target nodes and what the target nodes have acknowledged.  Also this contains the number of bytes per target/replicate combination which are currently in the queue.
       
  3. The Transaction Section

    This section contains information about the first and last transactions which are in memory in the queue.  

  4. The Handle Section.

    This section contains a list of each of the handles which has been allocated to each of the users of the queue.  The handle can be thought of as a cursor into the queue.  It is used to track the position within the queue.
The Current Statistics Section

The current statistics section is the first section in the onstat -g rqm sendq command.  It contains information about the current contents of the queue such as how many bytes are contained in the queue, how many transactions are in the queue, how many transactions are currently in memory, how many have been spooled to disk,  how many exist only current statistics of onstat -g rqm sendqon disk, etc.  

When a new transaction is placed into the queue, the transaction is given a stamp.  This stamp is used to maintain the order of the transactions within the queue.  This is a bit different from the commit order because the original commit order is only useful within the context of the server on which the transaction is originally committed.  In the case of  a system using hierarchical routing, it is possible that the send queue will have transactions which originated on other servers.  That would be the case of a replicated transaction which must be forwarded to another node.  In order to maintain the insert order, when a transaction is inserted into the send queue, it receives a stamp.  The stamp is a 64 bit integer which is maintained as part of the queue.  In this example, the next transaction to be inserted will be 638.

 In this example, the send queue currently contains 611 transactions of which 268 are in memory, 343 are not in memory at all, and 42  (611-569) are only in memory.  The reason that some of the spooled transactions are also in memory is that we spawn a group of spooling threads when we sense that we are getting close to running out of memory.  The spooled transaction is not immediately removed from memory, however.  Instead the spooled transaction will be removed from memory only when the memory limits are reached.  The reason for this pre-spooling is to avoid having to do a lot of work when we reach the memory limits.    Once a transaction has been spooled and the in-memory copy of the transaction has been removed, then the transaction is never completely reloaded back into memory.  Instead we transmit the transaction directly from the spooled disk copy of the transaction.

The Size of Data in queue is the size of the queue when combining the in-memory transactions with the spool-only transactions.  The Pending Txn Buffers contains information about transactions which are in the process of being queued into the qeue.

The Historical Statistics Section

Starting with Max Real memory data used, we enter the historical section.  This section contains a summary of  what has been placed in the queue in the past.  rqm -g sendq historical section
The Max Real memory data used contains the largest in memory size of the queue.  In this case, it reached up to 1,544,060 bytes.  The configured limits of the queue is currently configured to be 1,536,000 bytes, so when the transaction when into the queue which caused the limit to be reached, it triggered activity to flush the in-memory transactions which had already been spooled.  If no in-memory transactions had been spooled, then the thread placing transactions into the queue would have had to also spool the transactions.  That's why we spawn seperate spooling threads to perform the actual spooling.  We try to get the spooling done before we actually have to remove the transaction from memory.

There have been 638 transactions which have been queued to this queue.  That should match up with the insert stamp of the queue.  Of those 638 transactions, 569 have also been spooled.  At this point none of the spooled transactions have been restored.  The reason for that is that the only reason that the transactions were spooled is that I brought down one of the targets.  Since the target is down, then we will not be attempting to restore those transactions.  When that server is brought back up, then we would attempt to restore those transactions and send them to the target.

Recovered transactions are the transactions which existed only in the spool when the instanace was started.  They are not recovered by re-reading from the logical log, but are simply recovered from the disk storage when the engine is started.  They would have been snooped from the logical log at some time in the past, but now are found in the stable queue.

Total Txns deleted is the number of transactions that have been removed from the queue.  They may have been only in memory, only on disk in the stable queue, or in both.  The Total Txns duplicated contains the number of times that we attempted to queue a transaction which had already been processed.  This can occur when ER is first starting up as part of the instance startup, or as part of a cdr start command.  The Total Txn Lookups is simply a counter of the number of times that an ER thread attempted to read a transaction.

The Progress Tables Section

The progress table section contains information on what is currently queued, to which server it is queued for, and what has been ACKed from each of the participants of the replicate.  
onstat -g rqm progress table section
The first part of the progress table section is a summary.  The information in the receive queue progress table is written to disk as part of each transaction that the datasync thread applies.  This is not, however the case with the send queue progress table.  Instead the send queue progress table is copied to disk every so often.   In this example we see that the progress table is flushed to the table spttrg_send every 30 seconds.  Another thing which might trigger the flushing of the progress table is if over 1000 entries are dirtied.  

Below the summary section is a list of the servers and group entries which contain the information as far as what is currently queued for each server, what has been sent to the remote server, and what has been ACKed from the remote server.  The term Group is a carry-over from the 7.31 days when the replicate could be part of a replicate group.  It should really be "Replicate" in post-7.31 instances.   The contents of the ACKed and Sent columns contains the key of the last transaction which was acknowledged from the remote server or sent to that server.  The KEY is a multi-part number consisting of <source_node>/<unique_log_id>/<logpos>/<incremental number>.  From this we can see that the last transaction which we sent to server 3 was transaction 0x2f/0x1934c8 and the last transaction which has acknowledged is 0x28/0x684c8.

By examining the progress table we can discover which server is tending to lag behind.  In this example, server 2 is completely current, but server 3 is lagging somewhat behind.

At the very bottom of this example, we see the start of the transaction section.  This contains the first and last transaction in the queue which is currently in memory.

The RQM Handle Section

The last section contains the handles.  The RQM handle can be thought of as being much like a cursor.  It contains the position within the queue that any thread is currently processing.  onstat -g rqm sendq - RQM handle section
Each thread that attempts to read a transaction from the queue, or to place a transaction into the queue must first allocate a handle.  This handle is used to maintain the positioning within the queue.  By examining the RQM handle section, you can get an idea what each of the threads are doing.  For instance in this example, we see that CDRNsA2 (Send Thread to server 2) is at the end of the queue.  We also see that CDRNsT3 (Send Thread to server 3) is in the process of sending transaction 1/42/0xbc4c8.  

It might be a bit surprising to see which threads have handles on the send queue.  The network send threads make sense.  These would be the CDRNsxxx  threads.  However, it is a bit surprising to see that the receive threads (CDRNrxxx) have handles on the send queue.  The reason for this is because of routing.  When a transaction is received which must be forwarded to another server, then the receive thread will need to place that transaction into the send queue.  Therefore, it is not unusual to see that the receive threads will have a handle on the send queue.

The other handles make sense.  The grouper evaluator (CDRGeval##) has to have a handle on the send queue because it is placing transactions originating on this node into the send queue for transmission to a remote server.  The ACK threads (CDRACK##) would have a handle on the send queue because it must update the progress table and potentially delete a transaction when an ACK is received from a remote server.



Categories : [   ER  |  Queue  |  RQM  ]

Oct 22 2007, 09:18:00 AM EDT Permalink



Tuesday October 09, 2007

Sizing the Queue

SizingTheQueue
Sizing the Queue


Overview

In general there is not a lot of configuration items for Enterprise Replication.  One of the things which can be configured is the  in-memory max size of the queues.  This is configured by the onconfig parameter CDR_QUEUEMEM parameter.  The default value for this is 4096, which is probably too small.

There are two main queues used by ER - the send queue and the receive queue.  Transactions which have been retrieved from the logical log file and have been evaluated for replication are placed in the send queue for transmission to the target nodes.  A given transaction is placed in the send queue only once, even if it is to be sent to multiple target nodes.  When the transaction is received on the target node, it is placed in the receive queue where it waits its turn to be applied.

If the ER domain is defined to be using some form or a hierarchy, it is possible that the received replicated transaction will also be placed in the send queue so that it can be forwarded to other nodes.  In fact it is possible that the replicated transaction is only placed in the send queue.  That would be the case where the transaction might need to be forwarded, but the intermediate node is not a participate in replication.  However, for the purpose of this discussion, we will consider only a single source with a single target node.

First of all, the value of CDR_QUEUEMEM is not a preallocated block of memory which is used to store transactions.  It is a limit on the maximum memory size that an ER queue can expand to.  If this limit is reached then the replicated transaction may exist in the disk overflow space within a smartblob.  Also, the value of CDR_QUEUEMEM is not the max size of all of the queues.  Rather it is the max size of any specific queue.  That means that if CDR_QUEUEMEM is set to the default 4096, then both the send queue and the receive queue can grow up to 4 meg each.

Impact on the Send Queue

When the send queue approaches CDR_QUEUEMEM size,  spooling threads will be spawned to flush transactions to the configured smartblob space.  These spooled transactions are not immediately freed from memory however.  Instead we will not free the spooled transactions from memory until the CDR_QUEUEMEM limit is reached.   If that limit is reached, then the spooled transactions will be freed from memory and thus will exist only in the smartblob storage of the queue.  

When it comes time to send a transaction to the target, if the transaction exists only in the smartblob portion of the queue, then the transaction is transmitted directly from the spooled transaction to the target.  We do not reload the transaction totally into memory once it has been spooled and has been removed from main memory.

Impact on the Receive Queue

As the transaction is received on the target, it is placed in the receive queue where it remains until it is applied by the datasync threads.
We only spool a transaction in the receive queue if it exceeds 1/2 of the total queue memory size.  If the receive queue should reach the  CDR_QUEUEMEM limit, then the target will activate flow control by causing a NIF block.  The purpose of this is to prevent the source from sending any additional transactions until the receive queue drains a bit.

Sizing the Transaction

In order to correctly size the queues, it is important to know how much memory is required to store the transaction as it is in transit to the target server.  Each row within the replicated transaction contains fixed header which contains information about the row.  Also there can be a series of options which contain specific information about the row.  Probably the most common option is a hash value which is used to support apply parallelism.  Finally each replicated transaction will have a transaction header.  The current (IDS 11) size of the fixed row buffer header is 52 bytes on a 32-bit machine, and the size of the transaction header is 258 bytes.  For 64-bit machines, the size of the row buffer header is 60 bytes and the transaction header is 292 bytes.  The options is a variable list and can be of variable sizes.   However, for this discussion we will consider only the hash used in the apply parallelism which is 4 bytes.

The last part of the formula is the rowsize as is taken from the systables table.  If we examine the customer table of the stores database, we see that the rowsize is 134 bytes. Warehouses Schema Definition That means that the queue memory needed to contain a single row insert transaction  is 448 bytes.  We could therefor queue 9581 single row insert transactions or a single transaction of 22591 inserts before we reached the CDR_QUEUEMEM limit of 4 Meg.

Care has to be taken, however, if the replicated table contains variable length columns such as varchars or lvarches, because it is the expanded size of the row that is used by Enterprise Replication.  If we examine the warehouses table (right) in the stores demo database, we see that the table has three columns (warehouse_name, warehouse_id, and warehouse_spec).  Two of the columns are lvarchar column types of 2K size.  

However if we examine the size of the warehouses table from systables (below), we discover that the size of the row is 4106.  This means that we could only perform 971 single row insert transaction of the warehouses table or a single transaction of 1031 inserts before reaching the CDR_QUEUEMEM limit.

The fact that ER uses the expanded size of the row can be a surprise, especially if the lvarchar columns in the original row only contained short character strings.  This has even more of an impact if the environment is such that there is a lot of activity on the tables with the lvarchars.  In such a situation, spooling might occur if the value of CDR_QUEUEMEM is set too low.
Systable for warehouses table

Sizing Strategy

The most common strategy for configuring Enterprise Replication is generally to try to obtain the lowest possible latency.  In order to do that it is important to avoid spooling transactions to disk.  This means that the CDR_QUEUEMEM limits need to be fairly large.  Remember, ER is going to have to process all of the replicated tables which all of the client transactions are updating.  To do that, ER needs to be able to hold as many replicated transactions in memory as is possible.  It might be that we want size the queue memory based on the total allowed memory size.  As an example, for an update anywhere system, we might want to consider 1/8-1/6 of the total memory available since with update anywhere, we will have activity on both the send and receive queues.  That would mean that the total active queue memory would be between 1/4 and 1/3 of all available memory.

For instance if the total virtual memory size is configured to be 1 Gig, then we probably want to consider letting the CDR_QUEUEMEM be sized somewhere around 150 Meg for an update anywhere configuration or 200-250 Meg for a source/target configuration.  Don't forget that ER will have to process all of the activity that the client transactions are processing and to do that is going to require memory.  Otherwise, ER will have to spool transactions, and spooling will affect the latency of the apply.

 




Oct 09 2007, 12:00:00 AM EDT Permalink



Friday October 05, 2007

MACH the Knife

c:\mpruet\page1

The MACH11 Cluster


Overview

Setup of the HDR secondary
The Remote Standalone Secondary (RSS)
Setup of the RSS node
The Shared Disk Secondary (SDS)
Setup of the SDS node
Failover within the MACH11 Cluster
Special Case with SDS only Clusters
Promotion of the RSS node into an HDR Secondary
Demotion of the HDR Secondary into an RSS Node


Overview

The MACH11 cluster, introduced with IDS 11,  is an extension of the traditional HDR.  It provides a fully integrated solution  for multiple levels of availability, and is the foundation for Continuous Availability.  

The MACH11 cluster introduces two new types of secondary server which complement the existing HDR secondary. The first is the Remote Standalone Secondary (RSS) and the other is the Shared Disk Secondary (SDS).  The main difference between the RSS node and the SDS node is that while the RSS node maintains a physical copy of data on disk, the SDS node maintains only the shared memory buffer pool.  As the name implies, the SDS node is also attached to the same physical disks as the primary node by using a shared disk subsystem.

There can be only one primary node within the cluster.  Also there can be only one HDR secondary.  However, there can be any number of RSS  and/or SDS nodes within the cluster.  Also it is important to understand that only logged data is replicated within the MACH11 cluster.

For additional information checkout  "Availability Solutions with Informix Dynamic Server 11"

The High Availability Data Replication Secondary (HDR)

High Availability Data Replication (HDR) has been part of the Informix Dynamic Server since IDS 6.  It provides support for a hot backup system which is also available for dirty read processing.  HDR works by shipping the logs from the primary node to the secondary where they are applied to the physical chunks on the secondary.   HDR is a member of the  MACH11 cluster and much of the technology which is used to implement the rest of the MACH11 cluster is based on HDR technology

Setup of the HDR Secondary

There are six steps to bring up an HDR secondary.  The first two steps are often overlooked, but yet are fairly important.  This involves making sure that the chunk files exist on what will become the secondary server and making certain that any UDR/Datablade executable is installed on the secondary node. The files must have the same path as they do on the primary node and the UDR/Datablade executable must be in the same locations.  It may be that this involves nothing more than issuing the unix 'touch' command, or the establishment of links to the appropriate directory.  Also, care must be taken to ensure that the chunk files have the proper owner, group and permissions.  Generally these will be owner - informix, group - informix, permissions owner and group r+w.

The following chart describes the steps to create an HDR secondary.

Step Description Primary Secondary
1 Create chunk files on the secondary   This is a manual step which must be performed on the secondary node.
2 Install UDRs and datablades on the secondary   This is a manual step which must be performed on the secondary node.
3 Update the reserved pages and set this node into the primary and set the identity of the secondary node onmode -d primary <secondary_node>  
4 Perform a backup of the primary ontape -s -L 0 
(onbar -b -L 0)
 
5 Perform a physical restore on the secondary   ontape -p
(onbar  -r -p)
6 Mark the secondary as a secondary and point to the primary instance   onmode -d secondary <primary_node>

table 1

In step 3. we set a flag in the reserved pages which identifies this node as an HDR primary and also identify the network connection to the HDR secondary.  In step 4, we perform a full system backup of the primary and then (step 5) perform a physical restore on the HDR secondary node. ( A physical restore does not perform the rollforward of the logical log files.  That means when the ontape/onbar command is finished,  the restored instance is positioned at the backup checkpoint. The rollforward of the logs is done by transmitting the logical logs from the primary node.)

The HDR secondary must run the same executable binary oninit as the primary.  The host of the HDR secondary must be similar to the primary but does not have to be identical.  For instance, the primary might be a 24 processor system and the HDR secondary might be a 4 processor system.  However, it is important not to undersize the HDR secondary system because if  the HDR secondary is unable to process the log records as fast as the primary creates them, then backflow can occur.  If this should happen then user activity on the promary can block until the HDR secondary can catch up.  

While the HDR secondary can be used for report processing, its primary purpose is to provide failover support in the event that the primary node is lost.  To make the secondary into a primary node, simply run 'onmode -d primary <old_primary_node>'.

The Remote Standalone Secondary (RSS)

The primary purpose of the RSS node is to act as a backup for the HDR secondary.  If the primary node is down, the HDR secondary is normally promoted into the primary.  However, if the original primary is going to be down for an extended period of time, it is possible to promote the RSS node into the HDR secondary.

Unlike the HDR secondary, the RSS node communicates with the primary using a full duplexed model.  This means that it is not necessary for the secondary to acknowledge every message sent from the primary before the next message is sent.  Because the communication model is full duplexed, it is possible for communication with  the RSS secondary to  better utilize the network capacity.  This means that the RSS node can normally better utilize the available network bandwidth than the HDR secondary. 

That means that the RSS node is better able to handle long distance communication networks than the HDR secondary is.  However, this comes at a cost.  The RSS node can only work in asynchronous mode.  Even the checkpoint is asychronous.  Because of this, the RSS node is not able to be promoted directly into a primary node.  However, it can be promoted into the HDR secondary and then subsequently be promoted into the primary node.

Not only can the RSS node be promoted to the HDR secondary node, but also the HDR secondary node can be demoted into an RSS node.  This might be desired during some periods of time to take advantage of the full duplexed communications model.

Setup of the RSS node

RSS requires that the server utilize Index Page Logging.  Normally when an index is created, we only log the create index operation, not the work that is done by the create index itself.  With traditional HDR, the index pages from the index build are directly transmitted to the secondary as part of the index creation.  With RSS, we felt that the cost of attempting to transfer the index to multiple RSS nodes would impact user activity too much, so we chose instead to simply place those pages into the logical log. We do not place all of the index into the log as a single transaction.  Instead we may generate multiple transactions to log the index creation so as to avoid any long transaction during the index build.  This feature is activated by setting LOG_INDEX_BUILDS to 1 in the onconfig.   

The setup of an RSS node is very similar to the setup of the HDR secondary node. 

Step Description Primary RSS Node
1 Setup chunk files on the secondary   Manual process to create any chunk files on the secondary node
2 Install any UDRs and DataBlades   Manual process to install any UDRs and/or DataBlades on the secondary node
3 Register RSS node in the sysha database onmode -d add RSS <node> <password>  
4 Perform a backup on the source ontape -s -L 0
(onbar -b -L 0)
 
5 Perform physical restore on the secondary   ontape -p
(onbar -r -p)
6 Connect to the primary   onmode -d RSS <promary> <password>
table 2

We establish the potential RSS node in step three and also set an optional password for the initial connect request.  If the sysha database does not yet exist on the primary node, it will automatically be created in the root chunk.  The optional password is only used in the initial connection from the RSS node to the primary.  After taking a full system backup on the primary and restoring it on the secondary (again a physical restore), the setup is completed by issuing onmode -d RSS on the RSS node.  This will cause a network connection to be established with the primary and replication to be established.  If a password was used as part of  the onmode -d add RSS command on the primary, then the same password is required as part of the onmode -d RSS command.

The Shared Disk Secondary (SDS)

While the HDR secondary and RSS nodes maintain both the buffer cache and a disk copy of the database chunks, the shared disk secondary node only maintains the buffer cache.  Instead of maintaining a copy of the chunks on local disk, the SDS node uses the same physical disks as the primary on a shared disk subsystem such as Veritas or GPFS.  The reason that we implemented the Shared Disk Secondary was to take advantage of newer disk technology.  For instance, the customer might want to have a standby instance but use disk mirroring or some other means of hardware availability solution to provide for the disk redundancy.

The setup of the SDS node is a different process than the setup of the HDR secondary or RSS nodes.  Instead of performing a backup of the primary node and physical restore on the secondary node, the SDS node is instantiated by simply issuing a checkpoint on the primary node and the SDS node starting the roll-forward of the logs as of that checkpoint LSN.  As the primary is flushing logs to disk, it sends LSN that it has flushed to the SDS node.  The SDS node will then read and process the logs up to that LSN.  As the SDS node is processing log records, it sends a notification to the primary as to how far in the logs it has processes.  That way, the primary is able to determine when it is s