In October 2009, IBM announced a new technology, primarily aimed at online transaction processing (OLTP) scale-out clusters, called IBM DB2 pureScale. DB2 pureScale is a new feature that provides scaleout active-active services for IBM DB2 running on AIX on Power Systems servers. It's designed to deliver the highest levels of distributed availability and scalability wrapped in a well-thought-out, up-and-running path that's much easier to operate than other clustered database systems. In this article, we'll give you the basics of DB2 pureScale from a technology perspective, showing how DB2 pureScale delivers both transparent application scalability and extreme availability.
I see what you did there
If you're familiar with data sharing on DB2 for z/OS, then the DB2 pureScale architecture may look very similar- that's because it is! IBM took the fundamental tenets of DB2 for z/OS data sharing and coupled them with the most current distributed technologies to deliver unprecedented availability and scalability services to distributed platforms. We'd like to note one thing here: DB2 running on System z servers already delivers firstrate availability. For example, Toronto Dominion Bank (TD Bank) has had 100 percent availability of customer information for 10 consecutive years, including two DB2 for z/OS upgrades during that timeframe. Even the CEO of our biggest competitor said of DB2 for z/OS: "It's a first-rate piece of technology."
Figure 1 shows an example of a DB2 pureScale environment. A DB2 server that belongs to a pureScale cluster is called a member. Each member can simultaneously access the same database for both read and write operations. Currently, the maximum number of members in a pureScale cluster is 128.
Figure 1: In a DB2 pureScale cluster, each member has direct memory-based access to the centralized locking and caching services of the PowerHA pureScale server
The IBM PowerHA pureScale server provides centralized lock management services, a centralized global cache for data pages (known as the group buffer pool), and more. Each member in a DB2 pureScale datasharing group can interact directly with the PowerHA pureScale server through an InfiniBand network using User Direct Access Programming Library (uDAPL), a non-messaging base protocol that provides each member with point-to-point connectivity to the centralized locking and caching services.
Local agents, cluster-wide reach
Transparent application scaling means that applications don't have to be cluster-aware to truly take advantage of the scale-out architecture. To deliver this scaling, DB2 pureScale uses remote direct memory access (RDMA) technology along with PowerHA pureScale technology to eliminate communication between members for lock management and global caching services.
RDMA allows each member in the cluster to directly access memory in the PowerHA pureScale server, and vice versa, in microseconds. For example, assume that Member 1 in Figure 1 wants to read a data page that isn't in its local buffer pool. DB2 assigns an agent (or thread) to perform this transaction. The agent then uses RDMA to directly write into the memory of the PowerHA pureScale server to indicate that it has interest in a given page (this is called a read-and-register request). If the page that Member 1 wants to read is already in the global centralized buffer pool, the PowerHA pureScale server will push that page directly into Member 1's memory instead of having the agent on that member perform the I/O operation to read it from disk. Effectively, RDMA allows a member's agent to simply perform what appears to be a local memory copy operation, when in fact the target is the memory address of a remote machine.
These lightweight remote memory calls, along with a centralized buffer pool and lock management facilities, mean that an application does not have to connect to the member where the data already resides to achieve scalability. It is just as efficient for any member in the cluster to receive a data page from the global buffer pool, regardless of the size of the cluster. Most RDMA calls are so fast that the DB2 agent making the call doesn't even need to yield the CPU while waiting for the response. For example, to notify the PowerHA pureScale server that a row is about to be updated (and therefore an X lock is required), a member's agent performs a Set Lock State (SLS) request by writing the lock information directly into the PowerHA pureScale server's memory. The entire round-trip for this SLS operation can take less than 15 microseconds and therefore the agent likely doesn't need to yield the CPU.
Does your cluster know where your pages are?
DB2 pureScale takes availability to a whole new level. If a member in a DB2 pureScale cluster fails, DB2 provides full access to every page of data that doesn't need recovery. What's more, without performing a single I/O operation, DB2 is aware at all times of the specific pages that are in need of recovery. How does this happen? Every time a member reads a page into its buffer pool, the PowerHA pureScale server not only keeps track of this "interest," but also requests from members to update rows on those pages. Whenever an application commits a transaction, dirty pages are written directly into the PowerHA pureScale server. If any member fails, the PowerHA pureScale server has a list of pages that the failed member was in the process of updating as well as the pages that were updated and committed by the failed member but weren't yet written to disk.
When a failure occurs on a shared disk cluster, it's critical that no other node in the cluster reads or updates from disk any pages that might not have been recovered yet. Because the PowerHA pureScale server knows which pages were in the process of being updated by the failed node, and the PowerHA pureScale server already has the dirty committed pages from that member in its centralized buffer pool, DB2 pureScale doesn't need to block other members from continuing to process transactions while it locks the pages that need recovery.
What's more, the act of recovery in DB2 pureScale happens very quickly. Each member has processes that are sitting idle but are ready if a failure occurs. Should a member fail, one of these recovery processes is activated. Since these processes already exist, there's no need for the operating system to waste valuable time to create a process, allocate memory to it, and so on. This recovery process instantly begins to prefetch dirty pages from the centralized buffer pool into its own local buffer pool. In the majority of cases, this recovery won't require additional I/O operations because the pages that need recovery probably are already in the centralized buffer pool and can be transferred in microseconds using RDMA. Meanwhile, all other applications on all other members continue to process transactions on any page that doesn't need recovery and read pages from disk because the PowerHA pureScale server knows which pages on disk are clean and which need recovery.
For typical transactional workloads, the time from the member failure until the time the pages are recovered and available to another transaction is typically 20 seconds or less. Note that this recovery time includes the failure detection times, which many vendors exclude when referring to recovery times.
Finally, it is worth noting that components in the cluster-including the PowerHA pureScale server itself-are redundant. DB2 pureScale allows duplexing of the PowerHA pureScale server capability, so that locking information and shared cache information are stored in two separate locations in case the primary fails.
We've only scratched the surface of pureScale here. A lot of engineering is going on behind the scenes, so be sure to check out the links in the Resources sidebar. The bottom line is that for applications that need the highest levels of availability in a scale-out active-active configuration, DB2 pureScale delivers leading-edge capabilities to enhance your business continuity. And, with transparent application scalability, you no longer need to build cluster-aware applications in order to scale out to larger numbers of servers.