pureScale on Linux
Purescale delivers a high availability, scalable database clustering solution on commodity hardware. PureScale is mainly aimed at OLTP workloads and I believe it delivers on the promise that Oracle RAC has been making for several years and not quite delivering on. It delivers these benefits without the need for applications to be made cluster aware.
I think a lot of companies are looking to their IT to be scalable and flexible these days. Imagine you could buy a small number of commodity servers to run your application(s) on. Then simply add more when you need more resources. Swap servers in easily to help out with the processing load of end of financial year number crunching or the holiday rush on your online store. Pay the licenses fees and running costs when you use extra servers and then easily shut them down and not pay when they are not needed.
I guess this kind of flexibility is (arguably) available as follows:
1) Mainframe with virtualization. This is a great solution for those who have the skill set and budget for it. In my experience many companies are not ready for this. It's also difficult to shut down part of a mainframe!
2) Cloud computing. A great solution if you can "cloud enable" your applications and you trust the cloud service provider (a lot). Again many are not ready for this.
PureScale (especially on Linux / system x as it is available now) can give this kind of flexibility in a much more accessible package.
As part of my job here in Dublin with IBM I'm currently building a 6 node pureScale cluster on systems x with SLES 10. Watch this space for more on this...
...oh yes and please comment if you want to see more.
Just a brief look at the architecture of a pureScale cluster at a very high level. Questions welcome.
A DB2 pureScale cluster is made up of number of servers which are connected together, a shared area of disk and some software that all work together to provide a high performance and resilient database.
The cluster is made up of a number of "controllers" or Coupling facilities (CF) and a number of members.
There are normally 2 or more members. The number of members can be increased to add more processing power to the cluster.
I guess you might ask "why is this relevant?". Well 10 microseconds is approximately the time taken for a purescale member to communicate with the central cache to look for a piece of data. Let's call this a "pureScale communication" for the sake of simplicity. More on the technicalities of purescale communication, Remote Direct Memory Access (which facilitates this communication) etc in the next blog entry but for now...
...have you every stopped to think what length of time 10 microseconds represents?
A microsecond is 10 to the minus 6 seconds or one second divided by a million. I think this is so small a number that it is hard for us to understand. I looked for some examples to illustrate just how fast this is and there are some here on wikipedia but nothing that is intuitively understandable (at least not to me).
I though I could find something to explain this and here's a couple of things that are quite quick:
I give up, all I can say is 10 microseconds - that's fast, very fast!
A quick word on what circumstances pureScale is best suited to.
First to say what it is not not suited to i.e. data warehouse type applications. It is a shared disk solution and as such not really suitable for data warehousing. This is because of tendency of large transactions being the main workload in such an environment.
It is suited to OLTP loads.
Do you need to come up with a database solution for your application? This could be a new build or replacing old hardware and software.
Do you have an application that generates a lot of small of smallish transactions?
Do you need continuous availability and built in resilience?
Do you need to be able to ramp up the capacity of your system easily in the future rather than buying all of the hardware and licenses you might need over the next 2 - 5 years now?
If the answer if yes to most of these questions then pureScale is for you.
The question of preventing split brain scenario comes up again and again with regard to pureScale (PS).
The scenario is as follows:
In a standard PS setup we have a primary and a standby CF. If the connection between these two machines fails but both keep going then the secondary node would "think" that the primary has failed and perform a failover. Now both CFs would take control of the shared data (the database) and the database would end up in a big mess. This would happen if the networking between the two machines broke down or if one got really busy and couldn't respond to the other fast enough.
Of course if this was true the we would be in big trouble but fortunately it is not. A technology called I/O fencing is used to ensure the above scenario can't happen.
I/O fencing is implemented via SCSI-3 Persistent Reserve technology. The core of the technology involves “registration” and “reservation” rights to disk partitions. Registration allows access to data. Many nodes (members and Cfs) can have “registration” access but only one can hold ”reservation” on a partition. Registered nodes can eject others. Ejection is a final and atomic action. An ejected node cannot eject another node.
Cluster services software on each node
manages various failover scenarios in the cluster. There are
numerous failover scenarios and these things are worked out to the
nth degree. In outline if any failures are detected then all nodes work out what to do in a similar way. First of all to say what a quorum is. A quorum is a group of nodes in a cluster that can communicate with each other, the number of nodes in a quorum must be more than half of the total in the cluster or if exactly half must have "reserve" on the tie break partition. If I am part of a "quorum" I can continue and take part in a failover and recovery, the first part of which is to eject or fence any nodes that are not part of the quorum. This prevents the "bad" nodes from updating the shared data. If I am a "bad" node i.e. not part of a quorum I wait to regain access to the other nodes and when I regain access I must undo anything I have done locally since the problem started (tidy up). I can then rejoin the cluster.
There are two kinds of WLB:
1) Connection based WLB:
In summary this is based on routing new connections to the servers with the lowest load.
A list of information about servers is maintained.
This is updated regularly and when this is done a coordinating member will construct the server list.
Active members return their load information to the coordinating member. This includes Hostname, port number, CPU load, memory load.
The coordinating member then sends the server list to the other members.
On each server it's Weight is calculated by an algorithm and based on this server's load information and the total number of servers.
Higher weight means there is lower workload on this machine so send more workload to it.
The % workload being handled by a server is approximated from the number of connections the server is currently serving from the total number the entire cluster is serving.
% of workload to be sent to member = this member weight / (total of other member weights).
New connections are sent to servers where the "% workload being handled" is under the "% of workload to be sent to member"
2) Transaction based WLB.
This works in a similar way to the above and involves the server list.
Because we are not dealing purely with new connections as above, existing connections need to be actively rerouted to different members to rebalance the workload.
This works as follows.
A transport pool is maintained on each member, each connection can be moved from one member to an other (by disassociation from a transport on the first server and association to an transport on the second server).
After every 8 transactions or 2 seconds whichever comes first, each server will attempt to re-balance workloads by moving the logical connections if necessary.
WLB for purescale involving j2ee is configured in the j2ee driver file.
db2pd -serverlist shows the currently cached serverlist on this member (note priority and weight are synonymous).
We are currently building nanoclusters in several locations worldwide, including here in Dublin. Somehow the lab guys have come up with a way to get pureScale to work on about $500 worth of hardware! Respect!
The nanocluster is a pureScale cluster built on 3 intel Atom boxes with 1 acting as storage and running the demo software, the other two having one CF and one member on each. Gigabit ethernet is used between the nodes.
The nanoclusters (obviously) do not consist of a supported configuration and are not for production use.
Nanoclusters are a great way to get pureScale out there for demos and for clients, ISVs and partners to try it out for themselves. The cluster comes pre-packaged with demos and instructional software.
If you want to get your hands on one of these little beauties. Please contact your friendly local sales, avalanche teams or me for more information or a demo.
Unfortunately we can't give the nanoclusters out on loan for extended periods but we will also be releasing instructions and code to allow you to build your own.
Watch this space...
CiaranDeB 2700033FRG 1,573 Visits
Please see here for a form to request access.
We are currently setting up the tpc-c benchmark on the cluster. Tpc-c is the standard benchmark for OnLine Transaction Processing ( http://www.tpc.org/tpcc/ ). We will be doing some test runs on the pureScale cluster and some tuning to see what kind of throughput we can get out of the cluster for typical OLTP workloads . We will start out with 4 nodes first with the default parameters, then start tuning and tweaking. Please let me know if you want to know more?
CiaranDeB 2700033FRG 810 Visits
Outline spec of what we are currently working on...
Make the cluster setup as "production like" as possible and to be able to walk clients through the various features of this.
Be able to easily adjust the workload profile to have different mixes of short read, update, insert workload and longer ad-hoc queries to be similar to a clients's real workload.
Base the demo workload on a well known benchmark for OLTP systems (TPCC).
Using the Technology explorer front end.
Demonstrate workload balancing scenarios:
Client affinity for ad-hoc queries
Mixture of the above
Demonstrate failover scenarios and demonstrate continuous availability:
Member fail, various reasons
Standby CF fail, various reasons
Multiple failures, various reasons
Reproduce failover scenario and demonstrate minimal downtime:
Main CF fails
Monitoring information on Db2 on members , Show graphically second by second:
OPM dashboard is a useful overview on an aggregate (minute by minute) basis.
Workload/throughput information. Show graphically second by second:
Transactions per member for read, Non-read (insert, update, delete)
Overall throughput of transactions.
Automate the setup and tear down as far as possible.
Document the setup and running of the demo so that we can reproduce and also so that others can.