Programmers only: New ORDER BY information, Part 2

The impact of using the RANDOM index option on ORDER BY sort avoidance

Bonnie Baker helps you hone your SQL skills for DB2 for z/OS. In this column, she reviews sort avoidance concepts and sheds light on the costs and benefits of the RANDOM option for indexes.

This article was originally published in IBM Data magazine.

Share:

Bonnie Baker, Owner

Bonnie Baker specializes in teaching on-site classes for corporations, agencies, and DB2 user groups. She is an IBM DB2 Gold Consultant, an IBM Information Champion, a five-time winner of the IDUG Best Speaker award, and a member of the IDUG Speakers' Hall of Fame. She is best known for her ability to demystify complex concepts through analogies and war stories.



02 May 2011

Also available in Chinese

In the last issue, I began a series of columns concerning new aspects of ORDER BY. This column—Part 2—covers the impact of using the CREATE/ALTER INDEX RANDOM order option on sort avoidance. It also covers the issue of coding onlyGROUP BY COL1, COL2 versus coding bothGROUP BY COL1, COL2 and a follow-up ORDER BY COL1, COL2 in a SQL statement.

A brief review of sort avoidance

Sort syntax such as ORDER BY and GROUP BY does not necessarily cause a data sort. With the appropriate access path, the ORDER BY or GROUP BY requirement can be met without sorting. (An appropriate access path means that an index is available and used so that the index drives the access to the data. DB2 can then return qualified rows to the program in the desired order, the order of the index.)

But what happens when the order of the index column is random? Before DB2 9 for z/OS, we could CREATE indexes with columns that had ascending key values and/or descending key values. As of DB2 9, we can also CREATE indexes with columns that we designate as RANDOM. When we use this option, key values are encoded so that the values are stored randomly. Note that identical key values will encode the same and will therefore be contiguous. Let’s look at an example of index data where one of the columns is RANDOM. Assume an index on DEPTNO ASCENDING, LASTNAME RANDOM, EMPNO ASCENDING in which the decoded values are as follows:

ASC DEPTNORANDOM LASTNAMEASC EMPNO
A01 Yothers12345
A01 Josten21234
A01 Josten21235
A01 Miller14567
A01 Miller14568
A01 Miller14569
B01 Zagelow16657
B01 Bossman15678
B01 Bossman15679

Notice that DEPTNO is in ascending order. LASTNAME is in random order within each DEPTNO, but like values are contiguous. EMPNO is in ascending order within the like values of LASTNAME.


The costs and benefits of RANDOM

Why would anyone want to keep index data in random order? Currently, online transaction processing (OLTP) workloads in a data sharing environment can experience contention on index pages, especially during INSERTs, when an application program INSERTs into indexes with columns that contain current timestamps or ever-increasing values. These column types often create insertion hot spots on index pages. Then applications must wait to acquire busy index pages.

We can use randomized index key columns to reduce contention, but at what cost? There may be more CPU usage and getpage operations, as well as more index page read and write I/Os. The RANDOM option is useful when ascending insertions or hot spots cause this contention but the resulting cost is not prohibitive.

Another possible issue is that while each distinct column value is stored randomly, like-kind values are contiguous. Therefore a randomized index column can relieve contention problems for sets of similar or sequential values, but it's no help with identical values. Identical values encode the same, and each is inserted at the same place on the index tree.

Here are some interesting facts (some beneficial and some not) about the RANDOM column option.

It allows equality predicate lookups, such as LASTNAME = :LN, but it does not support matching range lookups, such as LASTNAME > :LN. The option can be used in non matching index scans in a screening fashion and as part of index-only access. Even though values are stored as random encoded values, we can retrieve the original, decoded value of the column. The option causes RUNSTATS to populate HIGH2KEY and LOW2KEY with the original, decoded value of the column. Finally, the RANDOM column option cannot be specified in the following cases:

  • For an index key column that is of variable length
  • If the index is created with the NOT PADDED option
  • As part of the GENERATE KEY USING clause
  • If the RANDOM column is used to determine the partition location of a table row

Also, DB2 cannot use random order index columns as part of a sort merge join. If a join is between one table that has an ascending index on the join column and a second table that has a randomized index column, the indexes are not in the same order and cannot be merged.


Sort avoidance with RANDOM columns

Now, what happens if the index that would have been ideal for DB2 to use to avoid a data sort is not an ordered list, but rather random?

For the examples that follow, we will use our three-column index on DEPTNO, LASTNAME RANDOM, EMPNO to determine how DB2 might (or might not) use an index for sort avoidance.

1)SELECT DEPTNO, LASTNAME, … FROM BIGTABLE
WHERE DEPTNO in ('A01', 'B01')
ORDER BY DEPTNO

For this ORDER BY, DB2 has no problem with the RANDOM column. The RANDOM column is not part of our ORDER BY. We can match on one column, and if we use the index the traditional way, the data will come back in DEPTNO order.

A01  	Yothers    …
A01  	Josten     …
A01  	Josten     …
A01 	Miller     …
A01 	Miller     …
A01 	Miller     …
B01  	Zagelow	   …
B01  	Bossman	   …
B01  	Bossman	   …

2)SELECT EMPNO, … FROM BIGTABLE
WHERE DEPTNO = 'A01'
AND LASTNAME = 'MILLER'
ORDER BY EMPNO

Again, there is no problem here with the RANDOM column. Because like-kind values of the RANDOM column are stored together, DB2 can do an equal lookup, matching on two columns of our index. And, if DB2 uses the index the traditional way, the data will come back in EMPNO order.

14567  	…
14568  	…
14569  	…

3)SELECT LASTNAME, EMPNO, … FROM BIGTABLE
WHERE DEPTNO = 'A01'
ORDER BY LASTNAME, EMPNO

Here DB2 can match on only one column, and since LASTNAME is random, DB2 must do a data sort to bring our rows back in LASTNAME, EMPNO order.

Josten  21234      …
Josten  21235      …
Miller  14567      …
Miller  14568      …
Miller  14569      …
Yothers 12345      …

4)SELECT LASTNAME, EMPNO, … FROM BIGTABLE
WHERE DEPTNO = 'A01'
AND LASTNAME BETWEEN 'MILLER' AND 'YOTHERS'
ORDER BY LASTNAME, EMPNO

In this example, DB2 can match only on DEPTNO and must do an index scan, screening on LASTNAME within that one value of DEPTNO to find all of the index entries where LASTNAME is between our low value and our high value. The index rows will not be in LASTNAME order within DEPTNO A01. Therefore, DB2 must sort to satisfy the ORDER BY.

Miller  14567      …
Miller  14568      …
Miller  14569      …
Yothers 12345      …

5)SELECT LASTNAME
FROM BIGTABLE
WHERE DEPTNO = 'A01'

In this example, there is no ORDER BY. In what order will our rows be returned? Since both LASTNAME and DEPTNO are part of the index, we will get index-only access. We have one matching predicate on DEPTNO. Since DB2 will only use the index the “traditional” way for index-only access, the rows will be returned just like our index data, in groups of identical LASTNAMEs in random order.

Yothers
Josten
Josten 
Miller  
Miller  
Miller

6)SELECT DEPTNO, LASTNAME, COUNT(*)
FROM BIGTABLE
GROUP BY DEPTNO, LASTNAME

In this example, DB2 will not have to sort. Our index is in DEPTNO order, and within that order, in groups of identical (but unordered) values of LASTNAME. Therefore, using control break logic, DB2 can return distinct DEPTNO, LASTNAME combinations without sorting. The rows will be in DEPTNO order but will not be in LASTNAME order within each DEPTNO. The answer is accurate because we did not ask that the rows be returned in ORDER, only GROUPed.

A01  	Yothers    1
A01  	Josten     2
A01 	Miller     3
B01  	Zagelow	   1
B01  	Bossman	   2

Because LASTNAME is random in the index, if we want our rows to be in LASTNAME order within DEPTNO, we must add an ORDER BY clause. This requirement to add an ORDER BY clause in addition to the GROUP BY clause is new with the RANDOM option. Before DB2 9, we needed to add an ORDER BY only if we had more than one index that could be chosen to do the grouping. Remember, we do not have an ASC or DESC option on GROUP BY. Therefore, if we have an INDEX on COL1 ASC, COL2 DESC, as well as an index on COL1 DESC, COL2 DESC, either index can be used to avoid the GROUP BY sort, and the one chosen will determine the order of the output rows. To avoid unpredictable output in this situation, we would need to code the ORDER BY clause (even before DB2 9).


A more realistic example

Consider the following realistic example of a situation where the RANDOM option might be preferable.

A user defines an index on the INVOICE_NUMBER column of the INVOICE_MASTER table in ASCENDING order. The user then inserts rows with an ascending sequence of values for the INVOICE_NUMBER column (0000100, 0000200, 0000300, ...). The index rows will be inserted at the end of the index, creating a hot spot. In this particular index, the user looks up specific invoices only by their INVOICE_NUMBER—in other words, only equality predicates are applied to the INVOICE_NUMBER column. To reduce contention, DROP and re-CREATE the index using the RANDOM option.

With the RANDOM specification, the unique INVOICE_NUMBER values will now be stored randomly throughout the index, preventing the contention that results from always inserting values at the end. Because the common use of this index is for looking up specific invoices by INVOICE_NUMBER, the new RANDOM ordering option provides a good solution.


A caution regarding RANDOM

We must be careful with the RANDOM option and ensure that our solution is not creating more cost from increased CPU, I/O, and sort overhead. The option is best used when we are actually solving a contention problem, when our predicates on the random column are EQUAL predicates or, if not EQUAL, the random columns do not often appear in ORDER BY clauses. The benefit must outweigh the additional cost or the RANDOM option will cause more problems than it solves. And, once we begin using the RANDOM option, we must be very careful to code both a GROUP BY and an ORDER BY when order matters.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=650754
ArticleTitle=Programmers only: New ORDER BY information, Part 2
publish-date=05022011