Recursion is an extremely efficient technique to resolve tree-like relationships. In relational databases, tree-like models are used to represent data for Bill of Materials (BOM), document hierarchies, organizational structures, etc. Without recursion, querying these data models would require tedious iterative programming in the host applications.
DB2® SQL programmers are familiar with the recursive construct available in IBM DB2 Universal Database. DB2 SQL provides the common table expressions (the WITH clause) to implement recursion. DB2 also provides user-defined functions (UDFs), with which you can incorporate application specific functions in the database server. Starting with DB2 Version 7.2, UDFs can be developed in DB2 SQL (they were previously able to be written in other high-level languages, like C and Java.)
In this article, I explain how you can combine the traditional recursive SQL in DB2 with SQL-based user-defined functions (SQL UDFs), to cut host program implementation time and to enhance application performance.
To provide the necessary background, I will start by giving an example of a tree structure. I will follow it with an example of a recursive query to resolve the tree structure. Later, I use a content management problem to illustrate the combined use of the DB2 recursive SQL and a SQL UDF.
Example of a recursive tree structure
Figures 1, 2 and 3 describe the logical view of a tree structure, its representation in a table and the result of a recursive query, respectively.
Figure 1 The logical view of a tree structure
In Figure 1, a content hierarchy is described as a tree containing a set of nodes. The tree is represented by a root node, Library, and its child nodes, news, sports, etc.
The above relationships can be stored in a relational table as shown in Figure 2.
Figure 2. Representing the tree node as a relational table
The Parent_id and Item_id columns in the table can be used to represent the parent-child relationship by using the Item_id of a node as the Parent_id of its children. (Note: The root node, Library, has no real parent; hence -1 is used as its parent.) The inherent cyclical nature of the data, manifested through the ability to consider any node as a child as well as a parent, facilitates recursion.
Given data such as above, recursion is useful to develop queries that "walk" the tree to produce various results, such as all the descendants (children and children's children and so on) of any given node.
For example, a recursive query to find the descendants of the node called "sports" produces the result set shown in Figure 3.
Figure 3. Results of a recursive query
A recursive query in DB2 is implemented using the WITH clause, also known as the common table expression. Common table expressions are available in DB2 on the workstation beginning with Version 5.
A common table expression is very similar to a temporary view. However, the temporary view is valid only for the duration of the single SQL statement in which it is defined. The common table expression takes the form of a WITH clause at the beginning of a SELECT statement. A common table expression can be used several times in the query that uses it, and it can also be joined to itself through aliasing, both features which make it useful for implementing recursion.
A recursive query typically has three parts:
- A virtual table in the form of a common table expression.
- An initialization table.
- A secondary table that does a full inner join with the virtual table.
All of the above tables are merged using UNION ALL. A final SELECT yields the required rows from the recursive output.
Figure 4. The recursive query that produced the results in Figure 3
WITH RPL (Parent_ID, Item_ID, Item_name) AS ( SELECT ROOT.Parent_ID, ROOT.Item_ID, ROOT.Item_Name FROM Subject ROOT WHERE ROOT.Parent_ID = 8 UNION ALL SELECT CHILD.Parent_ID, CHILD.Item_ID, CHILD.Item_Name FROM RPL PARENT, Subject CHILD WHERE PARENT.Item_ID = CHILD.Parent_ID ) SELECT DISTINCT Parent_ID, Item_ID, Item_Name FROM RPL ORDER BY Parent_ID, Item_ID, Item_Name
Let's look at the components of the query:
- RPL acts as a virtual table with three columns: Parent_ID, Item_ID and Item_name.
- The first SELECT statement inside the WITH clause is the initialization table. It is executed only once. Its results form the initial contents of the virtual table to act as the seed of the recursion. In the above example, the seed is the row or rows whose Parent_ID is 8.
- The second SELECT statement is executed multiple times. The seed is passed as input (the secondary table in the JOIN) to the second SELECT statement to produce the next set of rows. The results of the JOIN are added (UNION ALL) to the current contents of the virtual table and put back in it to form the input for the next pass. This process continues as long as rows are produced.
- The final SELECT on the virtual table allows us to select all or only a portion of the rows yielded by the recursive query, if desired.
The above process is depicted graphically in Figure 5.
Figure 5. Graphic access plan for the recursive query in Figure 4
Here is how to read the figure:
- The right side branch is the initialization query that scans the primary index file of the table Subject and places the selected rows in Temp (5) using Union (6).
- Temp (5) is the virtual table.
- The left side branch is the recursive part of the query. It joins, shown as NLJOIN (7), the current contents of Temp (5) to SUBJECT and places the results back in TEMP (5). This process repeats till there are no more rows to put in TEMP (5).
- A final sort and select operation yields the exact results needed.
This technique works well for any type of tree structure - divergent, convergent, balanced, unbalanced, and recursive types.
The references included in this article, [Birchall], [Chamberlin], and [IBM], provide excellent descriptions of common table expressions.
Problem: Handling multiple hierarchies
The recursive construct -- based on the common table expressions -- comes with several caveats. The most important caveat is that you cannot feed the row-level results of one recursive construct to another, in a correlated manner. The following statements are quoted from the IBM DB2 Universal Database SQL Reference [IBM].
If more than one common table expression is defined in the same statement, cyclic references between the common table expressions are not permitted (SQLSTATE 42835). A cyclic reference occurs when two common table expressions dt1 and dt2 are created such that dt1 refers to dt2 and dt2 refers to dt1.
As described above, multiple table expressions in the statement cannot reference one another in a correlated manner. A correlated subquery is the query that is executed once for every row in the result set.
Let me illustrate this problem with an example:
Consider a set of documents in a hierarchy of Subjects, such as that shown in Figure 1. Each subject contains a set of documents. Each document in the Subjects-Hierarchy points to an eligible user group, which is in a separate hierarchy of groups. That means there are two tree structures -- Subjects and User Groups -- both connected through the documents.
Figure 6 illustrates the scenario.
Figure 6. A scenario describing multiple hierarchies - Subjects and User Groups
Now let's define a problem: We want to find out for a given a user, Chris (a leaf in the User Group hierarchy), all eligible Subjects and their child documents. Such a query needs to expand two hierarchies at once; for every document in the Subject hierarchy, you need to expand the corresponding user hierarchy.
The query should produce results to show that user Chris is eligible to view Document 2 under Professional Football and Document 3 under Professional Soccer, but that he is not eligible to view Document 1. The expected results are shown in Figure 7.
Figure 7 Expected results of the defined problem
It is not possible to write a query to return this result using table expressions.Simultaneous expansion of the two hierarchies requires cyclical referencing of the table expressions, which is not currently supported directly in DB2. A cyclical reference occurs when the Subjects hierarchy refers to the Users hierarchy and vice versa.
A possible solution -- described by [Birchall] -- is to maintain an exploded table of the Groups. A set of INSERT,UPDATE, and DELETE triggers keeps the hierarchical table and the associated exploded table in sync. This way, you will be dealing with only one recursive hierarchy, and the other hierarchy is simply joined to the remaining product. Until recently, perhaps, it was the only solution.
With the availability of SQL user-defined functions (UDFs), this problem can be solved more elegantly as described in the later parts of this article. With SQL functions, we can use parameters to get past the correlation problem. Before going to the solution, I will briefly describe UDFs.
Overview of SQL user-defined functions (SQL UDFs)
User-defined functions, as the name suggests, are functions that you can add to the database. They can be of several types - scalar functions (like the Length() function), a row function, a column function, or a table function. Based on the principle of orthogonality [Chamberlin], the UDFs can be used in any SQL statement in accordance with the type of their return value or values.
UDFs can be developed in high-level languages, such as C or Java. Starting with DB2 UDB 7.2, UDFs can also be developed in DB2 SQL. The primary advantage of the SQL-based UDFs is that, unlike their high-level language counterparts, they can contain regular SQL statements to read the tables. As of DB2 UDB 7.2, UDFs in the high-level languages cannot read tables.
Use the CREATE FUNCTION statement to register a UDF to DB2. Consult the DB2 SQL Reference Manual for a detailed treatment of the CREATE FUNCTION statement [IBM].
The following CREATE FUNCTION statement creates a scalar UDF called "GETBONUS" which gets an employee as input. The actual operation performed by the UDF is embedded as the SQL statement inside the CREATE FUNCTION statement. It joins a set of tables, applies a formula over the relevant columns and returns a scalar value.
CREATE FUNCTION GETBONUS (EMP CHAR (32)) RETURNS Integer LANGUAGE SQL READS SQL DATA NO EXTERNAL ACTION DETERMINISTIC RETURN Select (... .) as "bonus" from TimeSheets JOIN Â where TimeSheets.Emp_ID=EMP
After the UDF is created, it is available as a function to other SQL statements. For example, you can create a query to get the bonus entitlement of all employees in the EMP table as shown below.
Select Emp_name, Emp_Dept, Emp_ID, GETBONUS (EMP_ID) from EMP
You might argue that everything a UDF can provide can also be achieved by using subqueries or by joining tables. Actually, UDFs are a way to adopt object-oriented programming principles in SQL.
In the above example, the GETBONUS UDF isolates the bonus calculations, which may change every year, into a single function. By isolating the calculations into a single UDF, changes in the bonus policy can be efficiently implemented, without having to make SQL changes to every report or a query that deals with the bonus amount.
Simple SQL functions, which contain only a RETURN statement, are effectively behaving like macros and are subject to DB2's query rewrite rules. Query rewriting is a pre-processing stage in the SQL Compiler in which SQL statements are transformed into forms that can be optimized more easily to improve the query performance.
Solution to our problem: multi-hierarchy recursion using UDFs
Now, let us revisit the problem described in Figure 6.
The solution is as follows:
- Expand the Subjects hierarchy.
- For each leaf (document) in the Subjects hierarchy, expand the Groups hierarchy.
- Find if the user (input) is in the assigned hierarchy (Groups).
The query is developed in two parts:
The first part is a scalar UDF with two inputs - the Document ID and the User ID. The UDF expands the Group hierarchy, starting from the node or nodes attached to the Document and checks whether the user is part of the hierarchy or not, and returns a value of 1 if it is or 0 if it is not.
The second part is a regular recursive query over the Subject hierarchy and makes use of the scalar UDF developed in the first part.
Let us first create a test environment to implement the solution.
Create the data model and test data
The following diagram illustrates the logical model of the database we are going to use.
Figure 8. The logical data model of the database to implement the scenario
To set up the tables, create a database and execute the SQL scripts described in the Appendix.
Create the SQL UDF
Let us develop the SQL UDF as outlined in the solution.
The UDF, which we name ISELIGIBLE, determines whether a user is eligible to view a document or not. It yields a scalar value 0 or 1.
It takes two inputs - document_id and user name -- and recursively resolves the nodes in the Group hierarchy starting with the nodes assigned to the input document_id.
Finally, it counts the row(s) that contain the input user name. If the user does not exist under any of the assigned user hierarchies, the count is zero; hence, the user is not eligible. If the count is one or more, the user is eligible to view the document.
Use the following CREATE FUNCTION statement to create the UDF. Bold is used to show the SQL query.
CREATE FUNCTION ISELIGIBLE (Doc Integer, Uname character (64)) RETURNS INTEGER LANGUAGE SQL READS SQL DATA NO EXTERNAL ACTION NOT DETERMINISTIC RETURN With Main (grp, subgrp, grpname) AS (SELECT grp, subgrp, grpname FROM Group WHERE (grp, subgrp) IN (Select grp, subgrp from Group_Document where document_id=Doc) UNION ALL SELECT C.grp, C.subgrp, C.grpname FROM Group C, Main M WHERE M.subgrp = C.grp ) Select count(*) from main where Ucase(grpname)=Ucase(Uname)
If you encounter any problem executing the CREATE FUNCTION statement from the Command Center, close the Command Center and use the DB2 Command Line Processor.
Create the recursive query
Let us complete the solution by completing the second step. In this step, we develop a recursive query that resolves the Subject hierarchy and the documents associated with each node. Then, we make use of the ISELIGIBLE function to determine whether the user is eligible to access the document or not.
The following SQL statement accomplishes what we want to do.
WITH RPL (Parent_ID, Item_ID, Item_name) AS ( SELECT ROOT.Parent_ID, ROOT.Item_ID, ROOT.Item_Name FROM Subject ROOT WHERE ROOT.Parent_ID =1 UNION ALL SELECT CHILD.Parent_ID, CHILD.Item_ID, CHILD.Item_Name FROM RPL PARENT, Subject CHILD WHERE PARENT.Item_ID = CHILD.Parent_ID ) SELECT DISTINCT rpl.Parent_ID, rpl.Item_ID, rpl.Item_name, document.document_id, document.name, ISELIGIBLE (document.document_id, char('Chris')) eligible From (rpl join Subject_Document sd on rpl.parent_id = sd.parent_id and rpl.item_id=sd.item_id) Join document on document.document_id = sd.document_id Order by parent_id, Item_id
- The outer WITH expands the Subjects hierarchy and yields document_id.
- We join these results with the table Document to get the document name.
- The SQL UDF, ISELIGIBLE, runs another correlated recursive query to find whether "Chris" is under any of the groups assigned to doc_id.
Figure 9 illustrates the results obtained by running the above query.
Figure 9. Results obtained from the solution
The results are indeed as expected. User "Chris" can access documents 2 and 3 under Professional Football and Professional Soccer, respectively.
When you run the recursive queries, it is normal to expect the SQL Warning, "SQL0347W The recursive common table expression may contain an infinite loop. SQLSTATE=01605".
Recursion is a powerful concept in any programming language. The recursive construct in DB2 SQL provides powerful means to handle real-life hierarchical structures elegantly and efficiently. By combining the power of SQL UDFs with the recursive construct, one can extend the flexibility and power of DB2 SQL to new frontiers.
These powerful features also significantly improve the performance of client programs by eliminating costly iterative application logic.
My sincere thanks to Steven Hiese (IBM) for a thorough technical review; Serge Rielau (IBM) for several thoughtful remarks and for helping me debug the CREATE FUNCTION situation in the early stages of development; Dirk Wollscheid (IBM), Sailesh Krishnamurthy (UC Berkeley), NDR Sarma (Texas AMU), Darrin Nelson (NetSetGo) and Mike Sieber (NetSetGo) for encouragement.
Appendix. Creating and implementing the data model
Creating the Database
CREATE DATABASE UDFTEST USING CODESET UTF-8 TERRITORY US
Implementing the Data Model
CREATE TABLE Subject ( item_id INTEGER NOT NULL, parent_id INTEGER NOT NULL, item_name CHAR(127) ); ALTER TABLE Subject ADD PRIMARY KEY (item_id, parent_id); CREATE TABLE Group ( grp INTEGER NOT NULL, isuser string2 NOT NULL, subgrp INTEGER NOT NULL, grpname CHAR(32) ); ALTER TABLE Group ADD PRIMARY KEY (grp, subgrp); CREATE TABLE Document ( document_id INTEGER NOT NULL, name VARCHAR(20) ); ALTER TABLE Document ADD PRIMARY KEY (document_id); CREATE TABLE Subject_Document ( item_id INTEGER NOT NULL, parent_id INTEGER NOT NULL, document_id INTEGER NOT NULL ); ALTER TABLE Subject_Document ADD PRIMARY KEY (item_id, parent_id, document_id); CREATE TABLE Group_Document ( grp INTEGER NOT NULL, subgrp INTEGER NOT NULL, document_id INTEGER NOT NULL ); ALTER TABLE Group_Document ADD PRIMARY KEY (grp, subgrp, document_id); ALTER TABLE Subject_Document ADD FOREIGN KEY (document_id) REFERENCES Document ON DELETE RESTRICT ON UPDATE RESTRICT; ALTER TABLE Subject_Document ADD FOREIGN KEY (item_id, parent_id) REFERENCES Subject ON DELETE RESTRICT ON UPDATE RESTRICT; ALTER TABLE Group_Document ADD FOREIGN KEY (document_id) REFERENCES Document ON DELETE RESTRICT ON UPDATE RESTRICT; ALTER TABLE Group_Document ADD FOREIGN KEY (grp, subgrp) REFERENCES Group ON DELETE RESTRICT ON UPDATE RESTRICT;
Creating the Test Data
After creating the tables, execute the following SQL statements to create the data.
Insert into Subject (Item_id, Parent_id, Item_Name) values (1, -1,'Library'), (2,1,'News'), (3,2,'WorldNews'), (4,2,'Politics'), (5,2,'Business'), (6,2,'Science'), (7,2,'Technology'), (8,1,'Sports'), (9,8,'Local'), (10,8,'Collegiate'), (11,8,'Professional'),(12,9,'Soccer'), (13,10,'Soccer'), (14,11,'Soccer'), (15,9,'Football'), (16,10,'Football'), (17,11,'Football') Insert into Group (grp, subgrp, grpname, isuser) values (-1,1,'Users','N'), (1,2,'Group A','N'), (1,3,'Group B','N'), (3,31,'Group B1','N'), (3,32,'Group B2','N'), (31,311,'Chris','Y'), (2,21,'Pat','Y'), (2,22,'Jane','Y'), (2,23,'Tom','Y') Insert into Document (document_id, name) values (1,'Document 1'), (2,'Document 2'), (3,'Document 3') Insert into Subject_Document (Item_id, Parent_id, document_id) values (17,11,1), (17,11,2), (14,11,3) Insert into Group_Document (grp, subgrp, document_id) values (1,2,1), (1,3,2), (3,31,3)
- Don Chamberlin, "A Complete Guide to DB2 Universal Database", Morgan-Kaufman Publishers. Inc., available at bookstores or at http://www1.fatbrain.com/asp/BookInfo/BookInfo.asp?theisbn=1558604820.
- IBM, IBM DB2 Universal Database SQL Reference V 7.2 at http://www.ibm.com/software/data/db2/library/.
- Graeme Birchall, "DB2 UDB V7.2 SQL Cookbook", at http://ourworld.compuserve.com/homepages/Graeme_Birchall/HTM_COOK.HTM.