Hierarchical Queries with DB2 Connect By
A new method for recursively processing data relationships
According to Wikipedia: Recursion (lat. recurrere -> currere = run and re = return) is the process of repeating items in a self-similar way. In mathematics or computer sciences recursion refers to a method of defining functions based on their own definition.
Recursive data stored in a table (or a physical file) point to data in another column/field in a different row/record within the same table/physical file.
Relationships between recursive data can be either hierarchical (a single direction, e.g. parent -> child, manager -> subordinate) or bi-directional or cyclic (e.g. flight connections Frankfurt -> New York, but also New York -> Frankfurt).
Tables containing Recursive Data
For all subsequent examples one of the tables STAFF (Organization Chart) or FLIGHTS (Flight connections) will be used.
The following figure shows a small organization chart and the table STAFF where these data are stored. The table STAFF consists of 3 columns:
- EMPLOYEE - Employee number
- NAME - Employee's name
- BOSS - Employee no of the direct supervisor
Figure 1: Hierarchical Relationship - Organization Chart - STAFF
The next figure shows a chart containing flight connections. When analyzing bi-directional data a cycle may occur that will result in an infinite loop (marked in red). The figure also displays an excerpt of the table FLIGHTS containing the flight connections. Within this sample data you will detect there is not only a connection between Berlin and Frankfurt, but also between Frankfurt and Berlin.
The Table FLIGHTS consists of the following columns:
- DEPARTURE - Departure city
- ARRIVAL - Arrival city
- PRICE - Costs of the connection
Figure 2: Bi-Directional Relationship - Flight Connections – FLIGHTS
Recursive Common Table Expression – RCTE
According to the SQL Standard, the method to evaluate hierarchical or bi-directional data is the Recursive Common Table Expression (RCTE).
The RCTE can be divided into 2 parts:
- The first part, the initial select statement, predefines the starting point (e.g. the manager in an organization chart or the departure in a flight connection).
- The second part represents the iteration or recursion. This second select statement is joined with the Common Table Expression (CTE) itself by linking parent to child data. The result of the current iterative select statement is merged with the result of the initial select statement and all results returned by the previous iterations using a UNION ALL clause.
Additional clauses that must be specified between the RTCE and final SELECT statement allow predefining the sequence in which the result is to be returned.
- SEARCH BREADTH FIRST is the default sequence and returns all data on the same level first before returning any data on the next level.
- SEARCH DEPTH FIRST must be specified if all child data (over all levels) must be returned first before returning the data of the next parent.
To avoid infinite loops in bi-directional relationships, a CYCLE clause must be specified between RCTE and final select statement. As soon as a cycle is detected the appropriate flag defined in the CYCLE clause is set and the recursive query process for the branch is stopped.
Note: To detect a cycle, only the results returned by the iterative SELECT statement are checked. The results returned by the initial SELECT statement are not considered.
The following figure shows the structure of an RCTE for querying recursive data.
Figure 3: Recursive Common Table Expression (RCTE) - Structure
In the next example, a RTCE is used to return all subordinates of manager Bauer (not only those on the next level).
The RCTE and the final SELECT statement both return four columns, representing the level and the employee number of the subordinate and his name and finally the employee number of the direct manager.
In the initial SELECT statement, the level is fixed to 1 and the name of manager Bauer is hardcoded in the WHERE clause, to predefine the starting point.
On the iterative select statement, the RTCE STAFFLIST is joined with the table STAFF by comparing the employee number (Column EMPLOYEE) out of the RCTE with the supervisor number (Column BOSS) out of the table STAFF. The level value is increased by one on each recursive iteration of the data. Additionally, the SEARCH DEPTH FIRST clause is specified. The temporarily created field SORT is used within the ORDER BY clause, to get the result returned in the depth-first sequence.
Listing 1. Example RCTE - Determine all subordinates
With StaffList (Level, Employee, Name, Boss) -- Initialization as (Select 1, Employee, Name, Bos From Staff Where Name = 'Bauer' -- Iteration Union All Select Level + 1, e.Employee, e.Name, e.Boss from StaffList s join Staff e on s.Employee = e.Boss) -- Search Clause Search Depth First By Employee set Sort -- Final Select Select * From StaffList Order By Sort;
Hierarchical Query Clause
The hierarchical query clause is available on the IBM i 7.1 release with a PTF (PTF SF99701 Version 9) as advertised in the DB2 for i Technology Updates wiki. This new support provides a different way to recursively navigate data relationships. The hierarchical query clause is easier to understand than the RCTE syntax. The hierarchical query support also corresponds to the non-standard method that is available with Oracle databases.
The hierarchical query clause is not implemented as common table expression, but the clause is specified as part of a sub-select and must be specified after the WHERE, GROUP BY or HAVING clause.
The hierarchical query clause consists of two portions:
- START WITH predicate defines the starting point and can be compared with the initial SELECT statement in a RCTE. It is possible to define multiple starting points by using SQL Predicates (such as IN or LIKE) or logical operators to combine several conditions.
- CONNECT BY predicate defines the join conditions between parent and child elements. For bi-directional relationships the keyword NOCYCLE must be specified within the CONNECT BY part. If NOCYCLE is not specified for a bi-directional relationship, the query is not executed but an error is returned (contrary to a RCTE which may end up in an infinite loop without appropriate CYCLE clause).
The following example shows a SELECT statement with a hierarchical query clause to return all direct subordinates for the manager Meier (employee number 101).
Listing 2. Hierarchical Query Clause - Next Level
Select 101 as Boss, Employee, Name From Staff Start with Boss = 101 Connect By Employee = Boss;
You may be disappointed because this kind of query easily could have been done with a simple WHERE clause, i.e., WHERE BOSS = 101.
To get all subordinates over all levels, the PRIOR operator must be added to the CONNECT BY part of the hierarchical query clause. The PRIOR operation must prefix the column to be resolved. For example if you want to determine all subordinates (Column EMPLOYEE) for a specific manager (Column BOSS), PRIOR must be placed in front of the EMPLOYEE Column. The PRIOR operator can be located left or right of the comparative operator.
In the following example, all subordinates (not only on the next level) for manager Bauer (employee number 202) are returned. In the first statement, PRIOR is used to the left of the equal sign while in the second statement it is specified on the other side. Both statements return the same result.
Listing 3. Hierarchical Query - Operator PRIOR
Select 202 Boss, Employee, Name From Staff Start with Boss = 202 Connect By Prior Employee = Boss;
Select 202 Boss, Employee, Name From Staff Start with Boss = 202 Connect By Boss = Prior Employee;
If the join key is a composite key, each key column must be prefixed with the PRIOR operator:
CONNECT BY PRIOR a.Department = b.Department AND PRIOR Employee = Boss
When comparing the result of the previous query with the data in table STAFF, the result is returned in depth-first order, that means all child rows are returned first before the next parent is returned (contrary to RCTEs whose default for returning data is breadth-first order, i.e., level by level).
Order Siblings By
Even though the data in the previous query is returned in search depth first sequence there is no specific order among the siblings, neither on the employee number 405, 404, 403 nor on the name Hund, Schmidt, Jäger.
Adding an ORDER SIBLINGS BY clause which is only allowed in composition with the hierarchical query clause and will return the siblings in the predefined sequence. Like in a regular ORDER BY clause multiple columns can be specified in either ascending or descending sequence.
In the next example, the previous query is modified, by adding an ORDER BY SIBLINGS clause to get the siblings in the result sorted by NAME. Previously Becker (employee number 303) was returned first, while in the modified query Ackermann (employee number 304) is returned in the first place and the subordinates Hund, Jäger and Schmidt are listed in alphabetical sequence, too.
Listing 4. Hierarchical Query - ORDER SIBLINGS BY
Select 202 Boss, Employee, Name From Staff Start with Boss = 202 Connect By Prior Employee = Boss; Order Siblings by Employee;
In the previous examples, the first column (i.e., the top manager) was specified as constant value. Hard coding or even worse duplicating hardcoded values (BOSS in Select and Start with clause) is never a good idea. And as soon as several starting points (e.g. multiple managers) are defined the hardcoded workaround will no longer work.
To avoid hard coding and duplicating hardcoded values, the CONNECT_BY_ROOT operator can be used, which always returns the value of its argument as it was during the initial step.
In the following example, all subordinates of the managers with the employee numbers 201 (Huber) and 202 (Bauer) or all subordinates for all managers with an employee number higher than 303, which will only be 304 (Ackermann) must be determined.
The CONNECT_BY_ROOT operator is used in this example to return the manager's employee number for the appropriate subordinate. To include the name of the manager, a sub-select is added, in which the name is determined out of the table STAFF by connecting the manager's employee number (column BOSS) with the employee number (column EMPLOYEE).
Listing 5. Hierarchical Query - CONNECT_BY_ROOT Operator
Select Connect_By_Root (b.Boss concat ' ' concat (Select Name From Staff a where a.Employee = b.Boss)) "boss", Employee, Name From Staff b Start With Boss in (201, 202) or Name like '%mann%' Connect By Prior Employee = Boss Order Siblings By Employee;
For hierarchical queries several pseudo columns are provided. A pseudo column is an identifier with a predefined name that holds information about the hierarchical query and that can be used like any column in a table.
- LEVEL – The pseudo column LEVEL returns an integer value that represents the recursive step in the hierarchy at which the row was produced.
- For all rows produced by the start with clause LEVEL is set to 1.
- For all rows produced by the iterations of the CONNECT BY clause LEVEL is raised by 1.
The pseudo column LEVEL can be used in the ORDER BY clause to return the result set in search breadth first sequence.
- CONNECT_BY_ISCYCLE returns a small integer value that is set to either 0 (no cycle detected) or 1 (cycle detected).
The pseudo column can not only be used to detect cycles but also to exclude them from the result.
- CONNECT_BY_ISLEAF returns a small integer value that is set either to 0 or 1.
The value of 1 is returned if the row is a leaf in the hierarchy, otherwise a value of 0 is returned. In hierarchical relationships, the CONNECT_BY_ISLEAF value is set to 1 for all elements that reside at the lowest level. In bi-directional relationships CONNECT_BY_ISLEAF returns the same value as CONNECT_BY_ISCYCLE.
In the following example, the pseudo column LEVEL is specified on the select list. The pseudo column value is also used to sort the query results in breadth-first sequence.
Listing 6. Hierarchical Query - Pseudo Column LEVEL
Select Connect_By_Root Boss "Boss", Level, Employee, Name From Staff Start With Boss = 101 Connect By Prior Employee = Boss Order by Level, Employee;
Another way to use the LEVEL pseudo column is shown in the following example. The pseudo column LEVEL is used in a string to make the hierarchy visible. Depending on the current level value, the name will be indented based on the number of blanks associated with the current level.
Listing 7. Hierarchical Query - Make Hierarchy visible
Select Employee, Name, Level, Space((Level - 1) * 3) concat '/' concat Employee concat ' - ' concat Trim(Name) Hierarchy from Staff Start With Boss in (201, 202) Connect By Prior Employee = Boss Order Siblings By Employee;
|301||Fischer||1||/301 - Fischer|
|401||Schulze||2||/401 - Schulze|
|302||Müller||1||/302 - Müller|
|303||Becker||1||/303 - Becker|
|403||Jäger||2||/403 - Jäger|
|404||Schmidt||2||/404 - Schmidt|
|405||Hund||2||/405 - Hund|
|304||Ackermann||1||/304 - Ackermann|
|406||Hermann||2||/406 - Hermann|
|407||Straub||2||/407 - Straub|
Scalar Function SYS_CONNECT_BY_PATH
The scalar function, SYS_CONNECT_BY_PATH, can only be used in composition with hierarchical queries. It builds a string containing all elements from the root to the current row. The result of the function is returned as a CLOB (Character Large Object) data type that can hold up to 1 MB of data.
The function SYS_CONNECT_BY_PATH expects 2 string expressions.
- The first expression must be the row value that must be concatenated
- The second expression is the separator that will be added to separate the elements.
In the following example, all flight connections between Frankfurt and Berlin are shown.
The keyword NOCYCLE specified within the START WITH … CONNECT BY clause is required because of the bi-directional data. Connections that result in a cycle are eliminated by specifying the pseudo column CONNECT_BY_ISCYCLE in the WHERE clause.
Based on the pseudo column LEVEL a maximum of two connections (LEVEL <= 3) is returned.
Because the starting point is not checked for cyclic data and 'Frankfurt' must not be connected again, it is eliminated in the CONNECT BY clause (Arrival <> 'Frankfurt').
The departure (always Frankfurt) is determined with CONNECT_BY_ROOT and concatenated with the string, generated by the SYS_CONNECT_BY_PATH function that concatenates all ARRIVALs from all iterations separated by a dash (' – ').
Listing 8. Function SYS_CONNECT_BY_PATH
Select Connect_By_Root Departure "Departure", Arrival, Level, Connect_By_Root Departure concat Sys_Connect_By_Path(Arrival, ' - ') "Connection" from flights Where Arrival = 'Berlin' and connect-By-isCycle = 0 and Level <= 3 Start With Departure = 'Frankfurt' Connect By NoCycle Prior Arrival = Departure and Arrival <> 'Frankfurt' Order By Level;
|Frankfurt||Berlin||1||Frankfurt - Berlin|
|Frankfurt||Berlin||2||Frankfurt - Hannover - Berlin|
|Frankfurt||Berlin||2||Frankfurt - Stuttgart - Berlin|
|Frankfurt||Berlin||2||Frankfurt - München - Berlin|
|Frankfurt||Berlin||3||Frankfurt - Hannover - Hamburg - Berlin|
|Frankfurt||Berlin||3||Frankfurt - Hannover - Bremen - Berlin|
|Frankfurt||Berlin||3||Frankfurt - Köln - Bremen - Berlin|
|Frankfurt||Berlin||3||Frankfurt - Köln - Hannover - Berlin|
|Frankfurt||Berlin||3||Frankfurt - Karlsruhe - Stuttgart - Berlin|
|Frankfurt||Berlin||3||Frankfurt - Stuttgart - München - Berlin|
|Frankfurt||Berlin||3||Frankfurt - München - Hannover - Berlin|
|Frankfurt||Berlin||3||Frankfurt - München - Stuttgart - Berlin|
Concatenating numeric values
Unfortunately SYS_CONNECT_BY_PATH can only be used for concatenating character values. Sometimes you would like to concatenate numeric values - for example, calculating the total costs of the flight connections.
The bad news is that there is neither a special function nor an operator for doing it directly in composition with the hierarchical query clause. The good news is there are two ways to achieve this objective, i.e., to calculate the total costs:
- The first option is to use an RCTE
- The second option is using the function SYS_CONNECT_BY_PATH to build a character string concatenating all costs as character values separated by a plus sign (+). And writing a simple SQL User Defined Function (UDF) that uses dynamic SQL to convert the formula passed in the string and calculate the value.
UDF to calculate the Formula passed as String
The SYS_CONNECT_BY_PATH function can only be used to concatenate strings (i.e., it cannot be used to calculate totals).
In SQL programming, a string containing any (mathematical) formula can be converted into an executable SQL statement with the SQL command PREPARE. This dynamically prepared statement can be executed using the SQL command EXECUTE.
The following SQL script contains the source code for creating the Calculate user-defined function (UDF). This UDF expects a Character Large Object (CLOB) containing a mathematical formula (for example 3+5*17) as an input parameter. This incoming parameter value will be embedded in another string to get the result of the formula calculated and returned.
The VALUES statement provides a method for executing a SQL statement without accessing a table or view. The VALUES … INTO statement allows an SQL expression to be executed and to return the result into variables. The ? (question mark) literal in the statement string is a parameter marker that represents the variable in which the result is returned.
The complete string (i.e., the incoming formula embedded in the VALUES … INTO, VALUES combination) is converted into an executable SQL statement using the SQL PREPARE statement. The prepared statement is executed with the EXECUTE statement. The variable to retrieve the result, RtnVal, is associated with the parameter marker in the USING clause.
If an error occurs, i.e., the formula contains invalid data, a continue handler will be activated and a negative default value (-999999999999.99) will be returned.
Listing 9. Calculate String UDF - Example
Create Function MYSCHEMA/CALCULATE (FormulaToCalc CLOB(1048576)) Returns Decimal (15, 2) Language SQL Not Fenced Set Option DbgView = *Source Begin Declare RtnVal Decimal(15, 2); Declare DynSQLStmt CLOB(1048576); Declare Continue Handler for SQLEXCEPTION Return -999999999999,99; If FormulaToCalc is NULL or FormulaToCalc = '' Then Return 0; End If; Set DynSQLStmt = 'Values(Values(' concat Trim(FormulaToCalc) concat ')) into ?'; Prepare DynSQL from DynSQLStmt; Execute DynSQL using RtnVal; Return RtnVal; End; |
As soon as the UDF is generated it can be used in composition with the function SYS_CONNECT_BY_PATH to calculate totals.
In the following example, the column LISTPRICES shows the string with the concatenated costs separated by a + (plus) sign. To display the result in the Cost column, this concatenated string is passed to and processed by the Calculate UDF.
Listing 10. UDF CALCULATE and function SYS_CONNECT_BY_PATH
Select Connect_By_Root Departure "Departure", Arrival, Level, Connect_By_Root Departure concat Sys_Connect_By_Path(Arrival, ' - ') Connections, Sys_Connect_By_Path(VarChar(Price), '+') ListPrices, Calculate(Sys_Connect_By_Path(VarChar(Price), '+')) "Cost" from flights Where Arrival = 'Berlin' and Connect_By_isCycle = 0 and Level <= 3 Start With Departure = 'Frankfurt' Connect By NoCycle Prior Arrival = Departure and Arrival <> 'Frankfurt' Order By Level;
|Frankfurt||Berlin||1||Frankfurt - Berlin||+80,00||80,00|
|Frankfurt||Berlin||2||Frankfurt - Hannover - Berlin||+55,00+90,00||145,00|
|Frankfurt||Berlin||2||Frankfurt - Stuttgart - Berlin||+40,00+85,00||125,00|
|Frankfurt||Berlin||2||Frankfurt - München - Berlin||+65,00+90,00||155,00|
|Frankfurt||Berlin||3||Frankfurt - Hannover - Hamburg - Berlin||+55,00+50,00+80,00||185,00|
|Frankfurt||Berlin||3||Frankfurt - Hannover - Bremen - Berlin||+55,00+40,00+85,00||180,00|
|Frankfurt||Berlin||3||Frankfurt - Köln - Bremen - Berlin||+40,00+45,00+85,00||170,00|
|Frankfurt||Berlin||3||Frankfurt - Köln - Hannover - Berlin||+40,00+70,00+90,00||200,00|
|Frankfurt||Berlin||3||Frankfurt - Karlsruhe - Stuttgart - Berlin||+50,00+45,00+85,00||180,00|
|Frankfurt||Berlin||3||Frankfurt - Stuttgart - München - Berlin||+40,00+45,00+90,00||175,00|
|Frankfurt||Berlin||3||Frankfurt - München - Hannover - Berlin||+85,00+80,00+90,00||235,00|
|Frankfurt||Berlin||3||Frankfurt - München - Stuttgart - Berlin||+65,00+45,00+85,00||195,00|
You should now be able to create all kinds of recursive queries with either hierarchical or bi-directional relationships using:
- The hierarchical query clause (START WITH … CONNECT BY) in composition with
- The operators PRIOR, NOCYCLE and CONNECT_BY_ROOT
- The special ORDER SIBLINGS BY clause
- The function SYS_CONNECT_BY_PATH
- The Pseudo Columns LEVEL, CONNECT_BY_ISCYCLE and CONNECT_BY_ISLEAF
- Recursive common table expressions (RCTE)
Additionally you should be able to use a UDF in composition with function SYS_CONNECT_BY_PATH to calculate totals within recursive relationships.
And now have fun experimenting with hierarchical query clauses.
IBM i - DB2 for i SQL Reference - 7.1: Chapter 5 – Queries