Migrating Recursive SQL from Oracle to DB2 UDB

After describing how recursive SQL works and what kinds of problems it solves, the author explains how to solve a problem which may occur when you migrate applications that use recursive SQL from Oracle to DB2 UDB.

Torsten Steinbach, Senior Technical Consultant for DB2 and WebSphere, IBM Germany

Photo: Torsten SteinbachTorsten Steinbach works as a Senior Technical Consultant for DB2 and WebSphere® for IBM Partner Enablement in EMEA. He can be contacted at torsten@de.ibm.com.



17 July 2003

©2003 International Business Machines Corporation. All rights reserved.

Important: Read the disclaimer before reading this article.

Introduction

IBM DB2 e-kit for Database Professionals

Learn how easy it is to get trained and certified for DB2 for Linux, UNIX, and Windows with the IBM DB2 e-kit for Database Professionals. Register now, and expand your skills portfolio, or extend your DBMS vendor support to include DB2.

Recursive SQL is a powerful means to query hierarchies of data. Organizational structures (department, sub-departments, sub-sub-departments and so on), discussion forums (posting, response, response to response etc.), bills-of-material, product classifications and document hierarchies are all examples of hierarchical data.

IBM® DB2® Universal DatabaseTM (UDB) is one of several relational database products that have implemented recursive SQL. In general the DB2 approach can be considered to be a highly powerful and flexible implementation. An example of DB2's recursive strength is the ability to query multiple hierarchies in a single DB2 table. (For more details about this, refer to Srini Venigalla's article Expanding Recursive Opportunities with SQL UDFs in DB2 v7.2.

If you are migrating your data from one RDBMS to another, it's important to know that the recursive SQL implementation varies from product to product. Specifically, in the section on The difference between Oracle and DB2 UDB, I'll explain a problem which regularly arises in migration projects from Oracle to DB2 involving recursive SQL.

The basic problem is that the default sort order of such queries is different in Oracle than in DB2. This doesn't sound important since a default sort order (without using an ORDER BY clause) should normally not be something an application should rely on. However in reality the Oracle-provided default sort order is used to solve a number of problems, such as displaying discussion threads. A lot of applications are based on the assumption of Oracle's sort order, and this must be understood when those applications are being migrated to DB2 UDB.

Besides merely explaining this problem I will of course also outline the solution for this dilemma in DB2. This is done in the section on Emulating the Oracle behavior in DB2 UDB.

To give the reader the necessary background information about recursion in general and recursive SQL specifically, I'll start by giving a short introduction to DB2 recursive SQL.


How does recursive SQL work?

A recursion is typically characterized by three basic steps:

  1. Initialization
  2. Recursion, or repeated iteration of the logic through the hierarchy
  3. Termination

In the initialization step, the work area is prepared and variables are set with initial values. One recursion consists of the business logic operating within this work area and then invoking the next recursion in a nested fashion. And finally the termination step is applied to limit the recursion. This can for instance mean to just count the nesting levels and stop execution when a certain level has been reached.

This principle also applies to recursive SQL in DB2. Recursive SQL is a query that can be divided into three execution phases:

  1. Create initial result set
  2. Recursion based on the existing result set
  3. Final query to return final result set

The initial result set is based on a regular SQL query on the base tables, which is the first part of a common table expression (CTE). This CTE is the vehicle to enable the recursion since the second part of refers to the CTE itself and joins it with the base tables. The query that selects from the CTE is the termination step.

The following example illustrates this process. Imagine the following table that contains some information about a department:

CREATE TABLE departments (deptid INT, 
				deptname VARCHAR(20), 
				empcount INT, 
				superdept INT)

The contents of this table represent a hierarchy. Figure 1 below is an example:

Figure 1. Example of a table hierarchy
Table hierarchy

To retrieve the number of employees of a given department including all sub-departments a recursive query is required:

WITH temptab(deptid, empcount, superdept) AS 
(		SELECT root.deptid, root.empcount, root.superdept 
		FROM departments root 
		WHERE deptname='Production' 
UNION ALL 
	SELECT sub.deptid, sub.empcount, sub.superdept 
	FROM departments sub, temptab super 
	WHERE sub.superdept = super.deptid 
) 
SELECT sum(empcount) FROM temptab

In this example the CTE is called temptab and it is growing as the query execution progresses. All three recursion elements are present:

  1. The initial result set is established in temptab. It contains the employee count of the department 'Production':

    SELECT root.deptid, root.empcount, root.superdept
    FROM departments root
    WHERE deptname='Production'
  2. The recursion takes place by joining each row in temptab with their sub-departments. The result of one execution of this recursion is added to temptab via UNION ALL:
    SELECT sub.deptid, sub.empcount, sub.superdept 
    FROM departments sub, temptab super 
    WHERE sub.superdept = super.deptid
  3. 3. The final query extracts the required information out of the CTE. In our example a summary aggregation is done:
    SELECT sum(empcount) 
    FROM temptab

Here is the result of the example query:

1 
----------- 
SQL0347W The recursive common table expression "TORSTEN.TEMPTAB" 
may contain an infinite loop. SQLSTATE=01605

		50

1 record(s) selected with 1 warning messages printed.

How DB2 executes this recursive query can be examined via the DB2 explain facility. The nested loop join (NLJOIN) is based on a temporary result table (TEMP) and the result of this join is put into this temporary table again via UNION.

Figure N. Explain of recursive SQL
Explain of recursive SQL

The difference between Oracle and DB2 UDB

With CONNECT BY PRIOR Oracle provides a similar feature. The above example would have been implemented in Oracle like this:

SELECT sum(empcount) 
FROM 
STRUCREL CONNECT BY PRIOR superdept = deptid 
START WITH deptname = 'Production';

Beside differing in syntax, DB2 and Oracle differ in functionality. Oracle provides the built-in pseudo column level when using CONNECT BY PRIOR. The following query in Oracle provides all departments and the hierarchy level they are in:

SELECT deptname, level 
FROM departments 
CONNECT BY PRIOR superdept = deptid 
START WITH deptname = 'Samples& Co.';

DEPTNAME LEVEL -------------------- -----------

Samples & Co. 			1 
Production 				2 
QA 					3 
Manufacturing 				3
Prebuilding 				4 
Finalbuilding 				4 
Sales 					2 
North 					3 
East 					3
South 					3 
West 					3 
IT 					2

This pseudo column is typically used to limit the recursion depth of such queries. For example, to retrieve only the direct sub-departments of the 'Sales' department, the following query could be used in Oracle:

SELECT deptname 
FROM departments 
CONNECT BY PRIOR superdept = deptid 
START WITH deptname = 'Sales' 
AND level=2;


DEPTNAME --------------------

North East South West

This feature can easily be emulated in DB2 by maintaining a custom pseudo-column in the CTE like this:

WITH temptab(deptid, deptname, superdept, level) AS 
( SELECT root.deptid, root.deptname, root.superdept, 1 
 FROM departments root 
 WHERE deptname='Sales' 
 UNION ALL 
 SELECT sub.deptid, sub.deptname, sub.superdept, super.level+1 
 	FROM departments sub, temptab super 
 	WHERE sub.superdept = super.deptid 
) 
SELECT deptname FROM temptab WHERE level=2;

Beside the level pseudo-column, another very important difference is the search order of the result set generated by recursive queries in DB2 and Oracle. In Oracle, the hierarchy is created by a search-depth-first algorithm. This provides a result set like the following when retrieving the entire example hierarchy:

SELECT deptname, level 
FROM departments 
CONNECT BY PRIOR superdept = deptid 
START WITH deptname = 'Samples & Co.;


DEPTNAME LEVEL -------------------------------

Samples & Co. 			1 
Production 				2 
QA 					3 
Manufacturing 				3
Prebuilding 				4 
Finalbuilding 				4 
Sales 					2 
North 					3 
East 					3
South 					3 
West 					3 
IT 					2

The result set demonstrates that each sub-node is first browsed completely before the query continues with the neighbor node. In DB2 however, the hierarchy is created via a search-width-first algorithm:

WITH temptab(deptid, deptname, superdept, level) AS
( SELECT root.deptid, root.deptname, root.superdept, 1
FROM departments root WHERE deptname='Samples &
Co.' UNION ALL SELECT sub.deptid, sub.deptname,
sub.superdept, super.level+1 FROM departments sub,
temptab super WHERE sub.superdept = super.deptid )
SELECT deptname, level FROM temptab;

DEPTNAME LEVEL -------------------- -----------

Samples & Co. 1 Production 2 Sales 2 IT 2 QA 3
North 3 East 3 South 3 West 3 Manufacturing 3
Prebuilding 4 Finalbuilding 4

This means the result set is created level-by-level. In this example this difference might not be a problem. But there are use cases of recursive SQL where the default sort order is of crucial importance. Imagine a table containing a discussion forum:

CREATE TABLE discussion (postid INTEGER, 
superid INTEGER, 
title VARCHAR2(100), 
text VARCHAR2(1000) )

To retrieve an overview over all discussion threads, the table could be queried like this in Oracle:

SELECT RPAD('', level-1, '--') || title FROM discussion 
CONNECT BY PRIOR superid = postid START WITH postid = 0;


1 
-------------------------------------

Install Problem 
--Re: Install Problem 
----Re: Install Problem 
------Re: Install Problem 
--------Got it
--General comment 
----Re: General Comment 
Cannot fin file 
--Re: Cannot find file 
----Re: Cannot find file
--Re: Cannot find file 
Help! Documentation missing!

With plain recursive SQL in DB2 this result set cannot be reproduced in this order. An attempt to do so would reproduce the following:

WITH temptab(superid, postid, title, text, level) 
AS
( SELECT root.superid, root.postid, root.title,root.text, 1 
	FROM discussion root 
	WHERE postid=0 
UNION ALL 
	SELECT sub.superid, sub.postid, sub.title,sub.text, 
	super.level+1 
FROM discussion sub, temptab super 
WHERE sub.superid = super.postid 
) 
SELECT
	VARCHAR(REPEAT('--', level-1) || title , 60) 
FROM temptab;


1
 ------------------------------------

Problem Discussions 
--Install Problem 
--Cannot find file 
--Help! Documentation missing! 
----Re: Install Problem 
----General comment 
----Re: Cannot find file
----Re: Cannot find file 
------Re: Install Problem
------Re: General Comment 
------Re: Cannot find file
--------Re: Install Problem 
----------Got it

It is obvious that this result set is completely useless for the user since the correlation of the discussion postings is lost.


Emulating the Oracle behavior in DB2 UDB

The solution to generate the search-depth-first order of Oracle in DB2 is based on the introduction of an additional pseudo-column that can be used in an ORDER BY attribute. This column is of type VARCHAR and contains a path to each node in the format like '1.3.1'. Additionally a user defined table function is introduced that returns all sub-nodes of a given node. It is also responsible to maintain the pseudo column code by concatenating the number of the sub-node to the path of the super-node. The RANK() function of DB2 is used to retrieve the number of a sub-node. The recursive query then selects from this function and provides the current node id and its path as the input.

The following example will create the exact same result set as the query in Oracle in the example above:

CREATE FUNCTION GetResponses(code VARCHAR(100),superid INTEGER) 
RETURNS TABLE(code VARCHAR(100),superid INTEGER, postid INTEGER, 
		title VARCHAR(100),text VARCHAR(1000)) 
READS SQL DATA DETERMINISTIC NO EXTERNAL ACTION 
RETURN SELECT GetResponses.code || '.'|| 
RTRIM(CHAR(RANK() OVER (ORDER BY postid))),
	T.superid , T.postid, T.title, T.text 
FROM discussion T
WHERE T.superid = GetResponses.superid;

WITH TEMPTAB(code, superid, postid, title, text, level)
AS ( VALUES(CAST('1' AS VARCHAR(100)), 
	CAST(NULL ASINTEGER), 0, 
	CAST(NULL AS VARCHAR(100)), CAST(NULL AS VARCHAR(1000)), 0) 
UNION ALL 
	SELECT t.code, t.superid,t.postid, t.title, t.text, level+1 
	FROM TEMPTAB,
	TABLE(GetResponses(TEMPTAB.code, TEMPTAB.postid)) AS T
) 
SELECT VARCHAR(REPEAT('--', level-1) || title , 60)
				FROM TEMPTAB T WHERE t.superid is not null 
ORDER BY code;


1 
-------------------------------------

Install Problem 
--Re: Install Problem 
----Re: Install Problem 
------Re: Install Problem 
--------Got it
--General comment 
----Re: General Comment Cannot find file 
--Re: Cannot find file 
----Re: Cannot find file
--Re: Cannot find file 
Help! Documentation missing!

To keep the statements in your application simple it is a good idea to wrap this recursive statement again into a UDF.


A better approach using DB2 Node type

You have to be aware that using a pseudo column based on a character string to enforce an order of the result set only guarantees you the hierarchy order in total. It does not necessarily provide the correct ordering of direct sub-nodes of a certain node if their number exceeds 9. This is because a string like "1.2.13" has a lower order than "1.2.8". But semantically the order is the other way around in this case. If you have dependencies on this and cannot guarantee to have 9 direct sub-nodes at most, then you must not use a string for the pseudo-column.

Instead you can use the DB2 Node type, which is a DB2 extension currently available on developerWorks DB2 (Using the Node Data Type to Solve Problems with Hierarchies in DB2 Universal Database by Jacques Roy). Make sure you are using at least version 1.1 of the Node type extension. The function nodeVersion() can be used to check this. If the function doesn't exist, you have an older version of the DB2 Node type.

So instead of maintaining the pseudo column code as VARCHAR, we use now the user defined type Node. The following example demonstrates this. It creates the same result as the example with VARCHAR above:

CREATE FUNCTION GetResponsesN(code Node, superid INTEGER) 
RETURNS TABLE(code Node, superid INTEGER,postid INTEGER, 
	title VARCHAR(100), text VARCHAR(1000))
READS SQL DATA DETERMINISTIC NO EXTERNAL ACTION 
RETURN SELECT nodeInput(nodeOutput(GetResponsesN.code) || '.'|| 
	RTRIM(CHAR(RANK() OVER (ORDER BY postid)))),
	T.superid , T.postid, T.title, T.text
 FROM discussion T
WHERE T.superid = GetResponsesN.superid;

WITH TEMPTAB(code, superid, postid, title, text, level)
AS 
( 
	VALUES(nodeInput('1.1'), CAST(NULL AS INTEGER), 0,
	CAST(NULL AS VARCHAR(100)), CAST(NULL ASVARCHAR(1000)), 0) 
UNION ALL 
SELECT t.code, t.superid,t.postid, t.title, t.text, 
level+1 FROM TEMPTAB,
TABLE(GetResponsesN(TEMPTAB.code, TEMPTAB.postid)) AS T
				
) 
SELECT VARCHAR(REPEAT('--', level-1) || title , 60)
	FROM TEMPTAB T 
	WHERE t.superid is not null 
	ORDER BY code;


1 
-------------------------------------

Install Problem 
--Re: Install Problem 
----Re: Install Problem 
------Re: Install Problem 
--------Got it
--General comment 
----Re: General Comment Cannot find file 
--Re: Cannot find file 
----Re: Cannot find file
--Re: Cannot find file 
Help! Documentation missing!

In order to create a Node value we have to use the function nodeInput() and provide a string like '1.2' as the input. For the root-node the input is '1.1' (Due to implementation specifics of the DB2 node type we have to start with 1.1 and not with 1). For all the other nodes we use again the RANK() function of DB2 to number direct sub-nodes. This is done in our function GetResponsesN(). This number is then concatenated to the character representation of the super-node (retrieved via nodeOutput()) and then the resulting string is taken as the input to create the new Node value via nodeInput().


Conclusion

The DB2 UDB approach to recursive SQL provides a very flexible way to process hierarchies. As this document demonstrates, the behaviors of other vendors can easily be emulated due to the easy extensibility of DB2 via user-defined functions. Furthermore, DB2 UDB provides the means to also process very advanced recursive queries such as those involving multiple hierarchies in a single table. Using the techniques I've described, you can take advantage of the strengths of DB2 when migrating your applications.


Disclaimer

This article contains sample code. IBM grants you ("Licensee") a non-exclusive, royalty free, license to use this sample code. However, the sample code is provided as-is and without any warranties, whether EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE THAT RESULT FROM YOUR USE OF THE SOFTWARE. IN NO EVENT WILL IBM OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USE OF OR INABILITY TO USE SOFTWARE, EVEN IF IBM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.


Acknowledgements

My thanks to Serge Rielau (IBM) for pointing to the right way to a solution; Jacques Roy (IBM) for reviewing and providing valuable input to this document; and Jacques Terrasse (Aldata) for the opportunity to work out this solution in a customer environment.


Download

DescriptionNameSize
Code samplerecursive.zip2 KB

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=13510
ArticleTitle=Migrating Recursive SQL from Oracle to DB2 UDB
publish-date=07172003