DB2 10: Multi Temperature Data Management
Comments (6) Visits (6811)
A few years ago I sat in a customer briefing and the customer representative summed up their database philosophy as "pack rats".
What she meant was that the data warehouse is ever growing at a high rate and nothing ever gets deleted.
Over the years DB2 has introduced numerous features which allow for ever growing databases such as:
DB2 welcomes all pack rats.
Being a pack rat however does have downsides however: Storage is expensive.
Or rather storage is expensive if you care for performance.
This PC that I use to write this article on uses an solid state drive (SSD).
Its pricey, which is why I only use it for the OS where I really appreciate the drive's fast random I/O.
In addition I have a much larger regular hard drive (HDD) which is cheaper, slower and stores any programs and documents.
Beyond that I also have a network attached storage (NAS) on which I store media files and backups.
This is my slowest storage since it's accessed via my wireless router.
My personal setup somewhat reflects what you see in IT shops, except that the price tags are much higher.
No company can afford to store all the data that is collected over the years on the very best technology.
The question is: How do you make the best of what you can afford?
Multi Temperature Data
It is reality that some data within a the database is more frequently accessed that other.
There may be tables that are rarely used compared to other that are central to every transaction and get accessed tens of thousands of times a second.
What you then do is to place those "cold" tables in cheaper, slower storage while placing the central "hot" tables onto faster, expensive storage.
In between there may be moderately "warm" tables which you place appropriately.
Prior to DB2 10 you did this by creating hot, warm and cold table spaces and then define tables within these tablespaces.
In DB2 10 a new concept of a STORAGE GROUP slides between the storage and the tablespace.
This makes management easier.
You can see which storage group the database has using the ADMI
IBMSTOGROUP is the default storage group.
In this case DB2 is using my D: as default because that's where DB2 itself was installed.
So this is my HDD.
I want to reserve that for WARM data.
So the first I do is rename the storage group to reflect this.
Further more I want to specify some data I have on the drive.
The average latency is 4.17ms and it's said to handle 300MB/sec.
(I have no clue whether my BUS can handle this, hardware is not my strength)
RENAME STOGROUP IBMSTOGROUP TO WARM;
Next I want to hook up my SSD drive and the NAS:
CREATE STOGROUP HOT ON 'C:' OVERHEAD 0.1 DEVICE READ RATE 200;
I'm making up the 20ms latency.
When I now create table spaces I merely specify the storage group.
Then I define my tables.
CREATE TABLESPACE H_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP HOT; CREATE TABLESPACE W_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP WARM; CREATE TABLESPACE C_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP COLD;
Really what we have done so far is not all that special.
We could have achieved the same using the storage paths of CREATE TABLESPACE.
Let's add another wrinkle.
Range partitioning and multi temperature data
I have been speaking about tables which need to be placed on different quality storage.
That is a simplistic environment however.
A much more interesting scenario is to have data with very different access patterns within the same table.
Imagine your bank account information.
Every time you do online banking you are likely to look at your recent transactions as well as pay some bills.
Once in a while you may want to retrieve some older information.Another time perhaps you are being audited and you need to get to some really old information.
Perhaps some company claims you didn't pay a bill on time and you want to see what happened.
Your bank is likely to preserve even these old transactions.
But there is no expectation that they can be retrieved as quickly as your recent history.
Therefore it makes sense to place the data onto different classes of storage.
If we use range partitioning this can be achieved:
CREATE TABLESPACE Q4_2010_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP COLD; CREATE TABLESPACE Q1_2011_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP WARM; CREATE TABLESPACE Q2_2011_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP WARM; CREATE TABLESPACE Q3_2011_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP WARM; CREATE TABLESPACE Q4_2011_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP HOT; CREATE TABLESPACE Q1_2012_tsp MANAGED BY AUTOMATIC STORAGE USING STOGROUP HOT;
Let's fill this table with some random data.
BEGIN DECLARE i INTEGER DEFAULT 0; WHILE i < 1000000 DO INSERT INTO transactions VALUES (i, RAND() * 1000, TIME
Now we check the performance for some of the quarters.
We test how often we can sum up all the amounts for a random account within a quarter.
CREATE OR REPLACE PROCEDURE drive(from TIMESTAMP, to TIMESTAMP) BEGIN DECLARE vi INTEGER DEFAULT 0; DECLARE vstart TIMESTAMP; DECLARE vstop TIMESTAMP; DECLARE vsum DECFLOAT; DECLARE vcount INTEGER; DECLARE vaccount INTEGER; WHILE vi < 5 DO SET vcount = 0; SET vstart = CURRENT TIMESTAMP; WHILE vstart + 5 second > CURRENT TIMESTAMP DO SET vaccount = RAND() * 1000; SELECT SUM(amount) INTO vsum FROM transactions WHERE stamp BETWEEN from AND to AND account = vaccount; SET vcount = vcount + 1; END WHILE; CALL DBMS
First the last quarter of 2011 which is on a HOT storage group.
To get good information we must disable self tuning of the bufferpool and keep it fairly small.
ALTER BUFFERPOOL IBMDEFAULTBP SIZE 250; SET SERVEROUTPUT ON; CALL drive('2011-10-01', '2011-12-31'); Time: 2012
Over five iterations of five seconds we averaged 120 iterations.
How does this compare to a quarter on warm storage?
CALL drive('2011-07-01', '2011-09-30'); Time: 2012
Much slower. Now for the data in the cold storage.
CALL drive('2010-10-01', '2010-12-31'); Time: 2012
So depending on the timestamp of the transaction we get different performance.
This is what we wanted
But... time keeps progressing.
Just like in fashion what's hot now is no so hot in three months
Managing the data life cycle
A quarter has passed and we must make room for second quarter 2012.
Space on the SSD drive is limited.
So we need to make room.
This is where storage groups really shine.
And this is why DB2 10 provides so much value in this area.
Since DB2 has slipped an abstraction layer between the storage path and the tablespace it is possible to move tablespace from one storage path to another.
And best of all, this is done online and in the background with a simple ALTER TABLESPACE statement
ALTER TABLESPACE Q4_2011_tsp USING STOGROUP WARM; COMMIT; CALL drive('2011-10-01', '2011-12-31'); Time: 2012
I was hoping to make the drop visible. But it happened to quickly.
Housekeeping slowed down the first few seconds and then performance was already where it should be for WARM data.
Perhaps we have more luck pushing out Q1 of 2011 into COLD storage.
ALTER TABLESPACE Q1_2011_tsp USING STOGROUP COLD; COMMIT;
The effect is immediate.
Clearly the REBALANCE that is happening in the background has an impact.
The fact that my hard drive has only a single spindle and this is a PC to begin with is exaggerating the overhead.
Luckily the effect doesn't last long.
CALL drive('2011-01-01', '2011-03-31'); Time: 2012
The system is back to normal and all quarters have been shifted.
The free room on the HOT storage can now be re-used to ATTACH a new partition for a new quarter.