Expanding Recursive Opportunities with SQL UDFs in DB2 v7.2

Recursion is an extremely efficient technique to resolve tree-like relationships. In relational databases, tree-like models are used to represent data for Bill of Materials (BOM), document hierarchies, and organizational structures. This article solves a complex problem involving multiple data hierarchies by combining recursive SQL with SQL user-defined functions.

Srini Venigalla (svenigalla@netsetgo.com), Chief Engineer, Netsetgo

Srini Venigalla is a Chief Engineer at NetSetGo, Inc. In February, NetSetGo was the winner of an IBM Beacon Award for the best WebSphere® e-business Solution. Srini is currently working in the areas of Content Management, Knowledge Delivery and B-2-B solutions using WebSphere, DB2, XML and Java technologies. He has a Bachelors Degree in Electrical Engineering and a Masters Degree in Computer Technology. He is a member of ACM. He can be reached at svenigalla@netsetgo.com.



01 March 2002

Introduction

Recursion is an extremely efficient technique to resolve tree-like relationships. In relational databases, tree-like models are used to represent data for Bill of Materials (BOM), document hierarchies, organizational structures, etc. Without recursion, querying these data models would require tedious iterative programming in the host applications.

DB2® SQL programmers are familiar with the recursive construct available in IBM DB2 Universal Database. DB2 SQL provides the common table expressions (the WITH clause) to implement recursion. DB2 also provides user-defined functions (UDFs), with which you can incorporate application specific functions in the database server. Starting with DB2 Version 7.2, UDFs can be developed in DB2 SQL (they were previously able to be written in other high-level languages, like C and Java.)

In this article, I explain how you can combine the traditional recursive SQL in DB2 with SQL-based user-defined functions (SQL UDFs), to cut host program implementation time and to enhance application performance.

To provide the necessary background, I will start by giving an example of a tree structure. I will follow it with an example of a recursive query to resolve the tree structure. Later, I use a content management problem to illustrate the combined use of the DB2 recursive SQL and a SQL UDF.


Example of a recursive tree structure

Figures 1, 2 and 3 describe the logical view of a tree structure, its representation in a table and the result of a recursive query, respectively.

Figure 1 The logical view of a tree structure
Logical view of a tree structure

In Figure 1, a content hierarchy is described as a tree containing a set of nodes. The tree is represented by a root node, Library, and its child nodes, news, sports, etc.

The above relationships can be stored in a relational table as shown in Figure 2.

Figure 2. Representing the tree node as a relational table
Parent_idItem_idItem_Name
-11Library
12News
23World News
24Politics
25Business
26Science
27Technology
18Sports
89Local
810Collegiate
811Professional
912Soccer
1013Soccer
1114Soccer
915Football
1016Football
1117Football

The Parent_id and Item_id columns in the table can be used to represent the parent-child relationship by using the Item_id of a node as the Parent_id of its children. (Note: The root node, Library, has no real parent; hence -1 is used as its parent.) The inherent cyclical nature of the data, manifested through the ability to consider any node as a child as well as a parent, facilitates recursion.

Given data such as above, recursion is useful to develop queries that "walk" the tree to produce various results, such as all the descendants (children and children's children and so on) of any given node.

For example, a recursive query to find the descendants of the node called "sports" produces the result set shown in Figure 3.

Figure 3. Results of a recursive query
Parent_idItem_idItem_Name
89Local
810Collegiate
811Professional
912Soccer
915Football
1013Soccer
1016Football
1114Soccer
1117Football

A recursive query in DB2 is implemented using the WITH clause, also known as the common table expression. Common table expressions are available in DB2 on the workstation beginning with Version 5.

A common table expression is very similar to a temporary view. However, the temporary view is valid only for the duration of the single SQL statement in which it is defined. The common table expression takes the form of a WITH clause at the beginning of a SELECT statement. A common table expression can be used several times in the query that uses it, and it can also be joined to itself through aliasing, both features which make it useful for implementing recursion.

A recursive query typically has three parts:

  • A virtual table in the form of a common table expression.
  • An initialization table.
  • A secondary table that does a full inner join with the virtual table.

All of the above tables are merged using UNION ALL. A final SELECT yields the required rows from the recursive output.

The recursive query that produced the results in Figure 3 is shown in Figure 4.

Figure 4. The recursive query that produced the results in Figure 3
WITH RPL (Parent_ID, Item_ID, Item_name) AS
(
							
SELECT ROOT.Parent_ID, ROOT.Item_ID, ROOT.Item_Name
FROM Subject ROOT
WHERE ROOT.Parent_ID = 8
							
UNION ALL
							
SELECT CHILD.Parent_ID, CHILD.Item_ID, CHILD.Item_Name
FROM RPL PARENT, Subject CHILD
WHERE PARENT.Item_ID = CHILD.Parent_ID
							
)
							
SELECT DISTINCT Parent_ID, Item_ID, Item_Name
FROM RPL
ORDER BY Parent_ID, Item_ID, Item_Name

Let's look at the components of the query:

  • RPL acts as a virtual table with three columns: Parent_ID, Item_ID and Item_name.
  • The first SELECT statement inside the WITH clause is the initialization table. It is executed only once. Its results form the initial contents of the virtual table to act as the seed of the recursion. In the above example, the seed is the row or rows whose Parent_ID is 8.
  • The second SELECT statement is executed multiple times. The seed is passed as input (the secondary table in the JOIN) to the second SELECT statement to produce the next set of rows. The results of the JOIN are added (UNION ALL) to the current contents of the virtual table and put back in it to form the input for the next pass. This process continues as long as rows are produced.
  • The final SELECT on the virtual table allows us to select all or only a portion of the rows yielded by the recursive query, if desired.

The above process is depicted graphically in Figure 5.

Figure 5. Graphic access plan for the recursive query in Figure 4
Graphic access plan for the recursive query in Figure 4

Here is how to read the figure:

  • The right side branch is the initialization query that scans the primary index file of the table Subject and places the selected rows in Temp (5) using Union (6).
  • Temp (5) is the virtual table.
  • The left side branch is the recursive part of the query. It joins, shown as NLJOIN (7), the current contents of Temp (5) to SUBJECT and places the results back in TEMP (5). This process repeats till there are no more rows to put in TEMP (5).
  • A final sort and select operation yields the exact results needed.

This technique works well for any type of tree structure - divergent, convergent, balanced, unbalanced, and recursive types.

The references included in this article, [Birchall], [Chamberlin], and [IBM], provide excellent descriptions of common table expressions.


Problem: Handling multiple hierarchies

The recursive construct -- based on the common table expressions -- comes with several caveats. The most important caveat is that you cannot feed the row-level results of one recursive construct to another, in a correlated manner. The following statements are quoted from the IBM DB2 Universal Database SQL Reference [IBM].

If more than one common table expression is defined in the same statement, cyclic references between the common table expressions are not permitted (SQLSTATE 42835). A cyclic reference occurs when two common table expressions dt1 and dt2 are created such that dt1 refers to dt2 and dt2 refers to dt1.

As described above, multiple table expressions in the statement cannot reference one another in a correlated manner. A correlated subquery is the query that is executed once for every row in the result set.

Let me illustrate this problem with an example:

Consider a set of documents in a hierarchy of Subjects, such as that shown in Figure 1. Each subject contains a set of documents. Each document in the Subjects-Hierarchy points to an eligible user group, which is in a separate hierarchy of groups. That means there are two tree structures -- Subjects and User Groups -- both connected through the documents.

Figure 6 illustrates the scenario.

Figure 6. A scenario describing multiple hierarchies - Subjects and User Groups
Scenario describing multiple hierarchies

Now let's define a problem: We want to find out for a given a user, Chris (a leaf in the User Group hierarchy), all eligible Subjects and their child documents. Such a query needs to expand two hierarchies at once; for every document in the Subject hierarchy, you need to expand the corresponding user hierarchy.

The query should produce results to show that user Chris is eligible to view Document 2 under Professional Football and Document 3 under Professional Soccer, but that he is not eligible to view Document 1. The expected results are shown in Figure 7.

Figure 7 Expected results of the defined problem
PARENT_IDITEM_IDITEM_NAMEDOCUMENT_IDNAMEELIGIBLE
1117Football2Document 21
1114Soccer3Document 31
1117Football1Document 10

It is not possible to write a query to return this result using table expressions.Simultaneous expansion of the two hierarchies requires cyclical referencing of the table expressions, which is not currently supported directly in DB2. A cyclical reference occurs when the Subjects hierarchy refers to the Users hierarchy and vice versa.

A possible solution -- described by [Birchall] -- is to maintain an exploded table of the Groups. A set of INSERT,UPDATE, and DELETE triggers keeps the hierarchical table and the associated exploded table in sync. This way, you will be dealing with only one recursive hierarchy, and the other hierarchy is simply joined to the remaining product. Until recently, perhaps, it was the only solution.

With the availability of SQL user-defined functions (UDFs), this problem can be solved more elegantly as described in the later parts of this article. With SQL functions, we can use parameters to get past the correlation problem. Before going to the solution, I will briefly describe UDFs.


Overview of SQL user-defined functions (SQL UDFs)

User-defined functions, as the name suggests, are functions that you can add to the database. They can be of several types - scalar functions (like the Length() function), a row function, a column function, or a table function. Based on the principle of orthogonality [Chamberlin], the UDFs can be used in any SQL statement in accordance with the type of their return value or values.

UDFs can be developed in high-level languages, such as C or Java. Starting with DB2 UDB 7.2, UDFs can also be developed in DB2 SQL. The primary advantage of the SQL-based UDFs is that, unlike their high-level language counterparts, they can contain regular SQL statements to read the tables. As of DB2 UDB 7.2, UDFs in the high-level languages cannot read tables.

Use the CREATE FUNCTION statement to register a UDF to DB2. Consult the DB2 SQL Reference Manual for a detailed treatment of the CREATE FUNCTION statement [IBM].

The following CREATE FUNCTION statement creates a scalar UDF called "GETBONUS" which gets an employee as input. The actual operation performed by the UDF is embedded as the SQL statement inside the CREATE FUNCTION statement. It joins a set of tables, applies a formula over the relevant columns and returns a scalar value.

CREATE FUNCTION GETBONUS (EMP CHAR (32))
RETURNS Integer
LANGUAGE SQL
READS SQL DATA
NO EXTERNAL ACTION
DETERMINISTIC
RETURN
Select (... .) as "bonus" from TimeSheets JOIN Â… where
TimeSheets.Emp_ID=EMP

After the UDF is created, it is available as a function to other SQL statements. For example, you can create a query to get the bonus entitlement of all employees in the EMP table as shown below.

Select Emp_name, Emp_Dept, Emp_ID, GETBONUS (EMP_ID) from EMP

You might argue that everything a UDF can provide can also be achieved by using subqueries or by joining tables. Actually, UDFs are a way to adopt object-oriented programming principles in SQL.

In the above example, the GETBONUS UDF isolates the bonus calculations, which may change every year, into a single function. By isolating the calculations into a single UDF, changes in the bonus policy can be efficiently implemented, without having to make SQL changes to every report or a query that deals with the bonus amount.

Simple SQL functions, which contain only a RETURN statement, are effectively behaving like macros and are subject to DB2's query rewrite rules. Query rewriting is a pre-processing stage in the SQL Compiler in which SQL statements are transformed into forms that can be optimized more easily to improve the query performance.


Solution to our problem: multi-hierarchy recursion using UDFs

Now, let us revisit the problem described in Figure 6.

The solution is as follows:

  1. Expand the Subjects hierarchy.
  2. For each leaf (document) in the Subjects hierarchy, expand the Groups hierarchy.
  3. Find if the user (input) is in the assigned hierarchy (Groups).

The query is developed in two parts:

The first part is a scalar UDF with two inputs - the Document ID and the User ID. The UDF expands the Group hierarchy, starting from the node or nodes attached to the Document and checks whether the user is part of the hierarchy or not, and returns a value of 1 if it is or 0 if it is not.

The second part is a regular recursive query over the Subject hierarchy and makes use of the scalar UDF developed in the first part.

Let us first create a test environment to implement the solution.

Create the data model and test data

The following diagram illustrates the logical model of the database we are going to use.

Figure 8. The logical data model of the database to implement the scenario
Logical data model of the database

To set up the tables, create a database and execute the SQL scripts described in the Appendix.

Create the SQL UDF

Let us develop the SQL UDF as outlined in the solution.

The UDF, which we name ISELIGIBLE, determines whether a user is eligible to view a document or not. It yields a scalar value 0 or 1.

It takes two inputs - document_id and user name -- and recursively resolves the nodes in the Group hierarchy starting with the nodes assigned to the input document_id.

Finally, it counts the row(s) that contain the input user name. If the user does not exist under any of the assigned user hierarchies, the count is zero; hence, the user is not eligible. If the count is one or more, the user is eligible to view the document.

Use the following CREATE FUNCTION statement to create the UDF. Bold is used to show the SQL query.

CREATE FUNCTION ISELIGIBLE
(Doc Integer, Uname character (64))
RETURNS INTEGER LANGUAGE SQL
READS SQL DATA
NO EXTERNAL ACTION NOT DETERMINISTIC
RETURN
With
Main (grp, subgrp, grpname) AS
(SELECT grp, subgrp, grpname
FROM Group
WHERE (grp, subgrp) IN
(Select grp, subgrp from Group_Document where document_id=Doc)
UNION ALL
SELECT C.grp, C.subgrp, C.grpname
FROM Group C, Main M
WHERE M.subgrp = C.grp
)
Select count(*) from main where Ucase(grpname)=Ucase(Uname)

If you encounter any problem executing the CREATE FUNCTION statement from the Command Center, close the Command Center and use the DB2 Command Line Processor.

Create the recursive query

Let us complete the solution by completing the second step. In this step, we develop a recursive query that resolves the Subject hierarchy and the documents associated with each node. Then, we make use of the ISELIGIBLE function to determine whether the user is eligible to access the document or not.

The following SQL statement accomplishes what we want to do.

WITH RPL (Parent_ID, Item_ID, Item_name) AS
( SELECT ROOT.Parent_ID, ROOT.Item_ID, ROOT.Item_Name
FROM Subject  ROOT
WHERE ROOT.Parent_ID =1
UNION ALL
SELECT CHILD.Parent_ID, CHILD.Item_ID, CHILD.Item_Name
FROM RPL PARENT, Subject CHILD
WHERE PARENT.Item_ID = CHILD.Parent_ID
)
SELECT DISTINCT rpl.Parent_ID, rpl.Item_ID, rpl.Item_name,
document.document_id, document.name, ISELIGIBLE
	(document.document_id, char('Chris')) eligible
From (rpl join Subject_Document sd on rpl.parent_id =
sd.parent_id and rpl.item_id=sd.item_id)
Join document on document.document_id = sd.document_id
Order by parent_id, Item_id
  1. The outer WITH expands the Subjects hierarchy and yields document_id.
  2. We join these results with the table Document to get the document name.
  3. The SQL UDF, ISELIGIBLE, runs another correlated recursive query to find whether "Chris" is under any of the groups assigned to doc_id.

Figure 9 illustrates the results obtained by running the above query.

Figure 9. Results obtained from the solution
PARENT_IDITEM_IDITEM_NAMEDOCUMENT_IDNAMEELIGIBLE
1114Soccer3Document 31
1117Football1Document 10
1117Football2Document 21

The results are indeed as expected. User "Chris" can access documents 2 and 3 under Professional Football and Professional Soccer, respectively.

When you run the recursive queries, it is normal to expect the SQL Warning, "SQL0347W The recursive common table expression may contain an infinite loop. SQLSTATE=01605".


Conclusion

Recursion is a powerful concept in any programming language. The recursive construct in DB2 SQL provides powerful means to handle real-life hierarchical structures elegantly and efficiently. By combining the power of SQL UDFs with the recursive construct, one can extend the flexibility and power of DB2 SQL to new frontiers.

These powerful features also significantly improve the performance of client programs by eliminating costly iterative application logic.


Acknowledgments

My sincere thanks to Steven Hiese (IBM) for a thorough technical review; Serge Rielau (IBM) for several thoughtful remarks and for helping me debug the CREATE FUNCTION situation in the early stages of development; Dirk Wollscheid (IBM), Sailesh Krishnamurthy (UC Berkeley), NDR Sarma (Texas AMU), Darrin Nelson (NetSetGo) and Mike Sieber (NetSetGo) for encouragement.


Appendix. Creating and implementing the data model

Creating the Database

CREATE DATABASE UDFTEST USING CODESET UTF-8
			
TERRITORY US

Implementing the Data Model

			CREATE TABLE Subject (
       item_id              INTEGER NOT NULL,
       parent_id            INTEGER NOT NULL,
       item_name            CHAR(127)
);
ALTER TABLE Subject
       ADD PRIMARY KEY (item_id, parent_id);
CREATE TABLE Group (
       grp                  INTEGER NOT NULL,
       isuser               string2 NOT NULL,
       subgrp               INTEGER NOT NULL,
       grpname              CHAR(32)
);
ALTER TABLE Group
       ADD PRIMARY KEY (grp, subgrp);
CREATE TABLE Document (
       document_id          INTEGER NOT NULL,
       name                 VARCHAR(20)
);
ALTER TABLE Document
       ADD PRIMARY KEY (document_id);
CREATE TABLE Subject_Document (
       item_id              INTEGER NOT NULL,
       parent_id            INTEGER NOT NULL,
       document_id          INTEGER NOT NULL
);
ALTER TABLE Subject_Document
       ADD PRIMARY KEY (item_id, parent_id, document_id);
CREATE TABLE Group_Document (
       grp                  INTEGER NOT NULL,
       subgrp               INTEGER NOT NULL,
       document_id          INTEGER NOT NULL
);
ALTER TABLE Group_Document
       ADD PRIMARY KEY (grp, subgrp, document_id);
ALTER TABLE Subject_Document
       ADD FOREIGN KEY (document_id)
                             REFERENCES Document
                             ON DELETE RESTRICT
                             ON UPDATE RESTRICT;
ALTER TABLE Subject_Document
       ADD FOREIGN KEY (item_id, parent_id)
                             REFERENCES Subject
                             ON DELETE RESTRICT
                             ON UPDATE RESTRICT;
ALTER TABLE Group_Document
       ADD FOREIGN KEY (document_id)
                             REFERENCES Document
                             ON DELETE RESTRICT
                             ON UPDATE RESTRICT;
				
				
ALTER TABLE Group_Document
       ADD FOREIGN KEY (grp, subgrp)
                             REFERENCES Group
                             ON DELETE RESTRICT
                             ON UPDATE RESTRICT;

Creating the Test Data

After creating the tables, execute the following SQL statements to create the data.

Insert into Subject
(Item_id, Parent_id, Item_Name)
values
(1, -1,'Library'), (2,1,'News'),
(3,2,'WorldNews'), (4,2,'Politics'),
(5,2,'Business'), (6,2,'Science'),
(7,2,'Technology'), (8,1,'Sports'),
(9,8,'Local'), (10,8,'Collegiate'),
(11,8,'Professional'),(12,9,'Soccer'),
(13,10,'Soccer'), (14,11,'Soccer'),
(15,9,'Football'), (16,10,'Football'),
(17,11,'Football')
				
Insert into Group
(grp, subgrp, grpname, isuser)
values
(-1,1,'Users','N'), (1,2,'Group A','N'),
(1,3,'Group B','N'), (3,31,'Group B1','N'),
(3,32,'Group B2','N'), (31,311,'Chris','Y'),
(2,21,'Pat','Y'), (2,22,'Jane','Y'),
(2,23,'Tom','Y')
				
Insert into Document (document_id, name)
values
(1,'Document 1'), (2,'Document 2'),
(3,'Document 3')
				
				
Insert into Subject_Document
(Item_id, Parent_id, document_id)
values
(17,11,1), (17,11,2), (14,11,3)
				
				
Insert into Group_Document (grp, subgrp, document_id)
values (1,2,1), (1,3,2), (3,31,3)

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=14100
ArticleTitle=Expanding Recursive Opportunities with SQL UDFs in DB2 v7.2
publish-date=03012002