Hierarchical Queries with DB2 Connect By

A new method for recursively processing data relationships

According to the SQL Standard, hierarchical (e.g. organization charts, bill of materials) or bi-directional data (e.g. flight connections) can be evaluated by using a recursive common table expression (RCTE). RCTEs were delivered with release V5R4 of DB2 for i. Other databases such as Oracle use a non-standard method for querying data called hierarchical query clause. To allow maximum portability the hierarchical query clause is introduced with PTF SF99701 Version 9 in DB2 for i. This article will explain the syntax of the hierarchical query clause, how it can be implemented in composition with new operators, pseudo columns and special scalar functions.

Share:

Birgitta Hauser (BHA@Toolmaker.de), Software Engineer, Toolmaker Advanced Efficiency GmbH

Birgitta Hauser photoBirgitta Hauser has been a Software Engineer since 2008, focusing on RPG, SQL and Web development on IBM i at Toolmaker Advanced Efficiency GmbH in Germany. She graduated with a business economics diploma, and started programming on the AS/400 in 1992. She also works in consulting and education as a trainer for RPG and SQL developers. Since 2002 she has frequently spoken at the COMMON User Groups in Germany, Switzerland and USA. In addition, she is co-author of two IBM Redbooks and also the author of several articles and papers focusing on RPG and SQL for a German magazine.



22 August 2011

Also available in Chinese

Recursion

According to Wikipedia: Recursion (lat. recurrere -> currere = run and re = return) is the process of repeating items in a self-similar way. In mathematics or computer sciences recursion refers to a method of defining functions based on their own definition.

Recursive data stored in a table (or a physical file) point to data in another column/field in a different row/record within the same table/physical file.

Relationships between recursive data can be either hierarchical (a single direction, e.g. parent -> child, manager -> subordinate) or bi-directional or cyclic (e.g. flight connections Frankfurt -> New York, but also New York -> Frankfurt).

Tables containing Recursive Data

For all subsequent examples one of the tables STAFF (Organization Chart) or FLIGHTS (Flight connections) will be used.

The following figure shows a small organization chart and the table STAFF where these data are stored. The table STAFF consists of 3 columns:

  • EMPLOYEE - Employee number
  • NAME - Employee's name
  • BOSS - Employee no of the direct supervisor
Figure 1: Hierarchical Relationship - Organization Chart - STAFF
Figure 1: Hierarchical Relationship - Organization Chart - STAFF

The next figure shows a chart containing flight connections. When analyzing bi-directional data a cycle may occur that will result in an infinite loop (marked in red). The figure also displays an excerpt of the table FLIGHTS containing the flight connections. Within this sample data you will detect there is not only a connection between Berlin and Frankfurt, but also between Frankfurt and Berlin.

The Table FLIGHTS consists of the following columns:

  • DEPARTURE - Departure city
  • ARRIVAL - Arrival city
  • PRICE - Costs of the connection
Figure 2: Bi-Directional Relationship - Flight Connections – FLIGHTS
Figure 2: Bi-Directional Relationship - Flight Connections – FLIGHTS

Recursive Common Table Expression – RCTE

According to the SQL Standard, the method to evaluate hierarchical or bi-directional data is the Recursive Common Table Expression (RCTE).

The RCTE can be divided into 2 parts:

  • The first part, the initial select statement, predefines the starting point (e.g. the manager in an organization chart or the departure in a flight connection).
  • The second part represents the iteration or recursion. This second select statement is joined with the Common Table Expression (CTE) itself by linking parent to child data. The result of the current iterative select statement is merged with the result of the initial select statement and all results returned by the previous iterations using a UNION ALL clause.

Additional clauses that must be specified between the RTCE and final SELECT statement allow predefining the sequence in which the result is to be returned.

  • SEARCH BREADTH FIRST is the default sequence and returns all data on the same level first before returning any data on the next level.
  • SEARCH DEPTH FIRST must be specified if all child data (over all levels) must be returned first before returning the data of the next parent.

To avoid infinite loops in bi-directional relationships, a CYCLE clause must be specified between RCTE and final select statement. As soon as a cycle is detected the appropriate flag defined in the CYCLE clause is set and the recursive query process for the branch is stopped.

Note: To detect a cycle, only the results returned by the iterative SELECT statement are checked. The results returned by the initial SELECT statement are not considered.

The following figure shows the structure of an RCTE for querying recursive data.

Figure 3: Recursive Common Table Expression (RCTE) - Structure
Figure 3: Recursive Common Table Expression (RCTE) - Structure

In the next example, a RTCE is used to return all subordinates of manager Bauer (not only those on the next level).

The RCTE and the final SELECT statement both return four columns, representing the level and the employee number of the subordinate and his name and finally the employee number of the direct manager.

In the initial SELECT statement, the level is fixed to 1 and the name of manager Bauer is hardcoded in the WHERE clause, to predefine the starting point.

On the iterative select statement, the RTCE STAFFLIST is joined with the table STAFF by comparing the employee number (Column EMPLOYEE) out of the RCTE with the supervisor number (Column BOSS) out of the table STAFF. The level value is increased by one on each recursive iteration of the data. Additionally, the SEARCH DEPTH FIRST clause is specified. The temporarily created field SORT is used within the ORDER BY clause, to get the result returned in the depth-first sequence.

Listing 1. Example RCTE - Determine all subordinates
With StaffList (Level, Employee, Name, Boss)

     -- Initialization
     as (Select 1, Employee, Name, Bos
            From Staff
            Where Name = 'Bauer'

     -- Iteration
        Union All
        Select Level + 1, e.Employee, e.Name, e.Boss
           from StaffList s join Staff e
             on s.Employee = e.Boss)

     -- Search Clause
     Search Depth First By Employee set Sort

-- Final Select
Select *
   From StaffList
   Order By Sort;
Level Employee Name Boss
1 202 Bauer 101
2 303 Becker 202
3 403 Jäger 303
3 404 Schmidt 303
3 405 Hund 303
2 304 Ackermann 202
3 406 Hermann 304
3 407 Straub 304

Hierarchical Query Clause

The hierarchical query clause is available on the IBM i 7.1 release with a PTF (PTF SF99701 Version 9) as advertised in the DB2 for i Technology Updates wiki. This new support provides a different way to recursively navigate data relationships. The hierarchical query clause is easier to understand than the RCTE syntax. The hierarchical query support also corresponds to the non-standard method that is available with Oracle databases.

The hierarchical query clause is not implemented as common table expression, but the clause is specified as part of a sub-select and must be specified after the WHERE, GROUP BY or HAVING clause.

The hierarchical query clause consists of two portions:

  • START WITH predicate defines the starting point and can be compared with the initial SELECT statement in a RCTE. It is possible to define multiple starting points by using SQL Predicates (such as IN or LIKE) or logical operators to combine several conditions.
  • CONNECT BY predicate defines the join conditions between parent and child elements. For bi-directional relationships the keyword NOCYCLE must be specified within the CONNECT BY part. If NOCYCLE is not specified for a bi-directional relationship, the query is not executed but an error is returned (contrary to a RCTE which may end up in an infinite loop without appropriate CYCLE clause).

The following example shows a SELECT statement with a hierarchical query clause to return all direct subordinates for the manager Meier (employee number 101).

Listing 2. Hierarchical Query Clause - Next Level
Select 101 as Boss, Employee, Name
   From Staff
   Start with Boss = 101
   Connect By Employee = Boss;
Boss Employee Name
101 201 Huber
101 202 Bauer

You may be disappointed because this kind of query easily could have been done with a simple WHERE clause, i.e., WHERE BOSS = 101.

PRIOR Operator

To get all subordinates over all levels, the PRIOR operator must be added to the CONNECT BY part of the hierarchical query clause. The PRIOR operation must prefix the column to be resolved. For example if you want to determine all subordinates (Column EMPLOYEE) for a specific manager (Column BOSS), PRIOR must be placed in front of the EMPLOYEE Column. The PRIOR operator can be located left or right of the comparative operator.

In the following example, all subordinates (not only on the next level) for manager Bauer (employee number 202) are returned. In the first statement, PRIOR is used to the left of the equal sign while in the second statement it is specified on the other side. Both statements return the same result.

Listing 3. Hierarchical Query - Operator PRIOR
Select 202 Boss, Employee, Name
   From Staff
   Start with Boss = 202
   Connect By Prior Employee = Boss;
Boss Employee Name
202 303 Becker
202 405 Hund
202 404 Schmidt
202 403 Jäger
202 304 Ackermann
202 407 Straub
202 406 Hermann
Select 202 Boss, Employee, Name
   From Staff
   Start with Boss = 202
   Connect By Boss = Prior Employee;
Boss Employee Name
202 303 Becker
202 405 Hund
202 404 Schmidt
202 403 Jäger
202 304 Ackermann
202 407 Straub
202 406 Hermann

If the join key is a composite key, each key column must be prefixed with the PRIOR operator:

CONNECT BY     PRIOR a.Department = b.Department
           AND PRIOR Employee     = Boss

When comparing the result of the previous query with the data in table STAFF, the result is returned in depth-first order, that means all child rows are returned first before the next parent is returned (contrary to RCTEs whose default for returning data is breadth-first order, i.e., level by level).

Order Siblings By

Even though the data in the previous query is returned in search depth first sequence there is no specific order among the siblings, neither on the employee number 405, 404, 403 nor on the name Hund, Schmidt, Jäger.

Adding an ORDER SIBLINGS BY clause which is only allowed in composition with the hierarchical query clause and will return the siblings in the predefined sequence. Like in a regular ORDER BY clause multiple columns can be specified in either ascending or descending sequence.

In the next example, the previous query is modified, by adding an ORDER BY SIBLINGS clause to get the siblings in the result sorted by NAME. Previously Becker (employee number 303) was returned first, while in the modified query Ackermann (employee number 304) is returned in the first place and the subordinates Hund, Jäger and Schmidt are listed in alphabetical sequence, too.

Listing 4. Hierarchical Query - ORDER SIBLINGS BY
Select 202 Boss, Employee, Name
   From Staff
   Start with Boss = 202
   Connect By Prior Employee = Boss;
   Order Siblings by Employee;
Boss Employee Name
202 303 Becker
202 403 Jäger
202 404 Schmidt
202 405 Hund
202 304 Ackermann
202 406 Hermann
202 407 Straub

CONNECT_BY_ROOT Operator

In the previous examples, the first column (i.e., the top manager) was specified as constant value. Hard coding or even worse duplicating hardcoded values (BOSS in Select and Start with clause) is never a good idea. And as soon as several starting points (e.g. multiple managers) are defined the hardcoded workaround will no longer work.

To avoid hard coding and duplicating hardcoded values, the CONNECT_BY_ROOT operator can be used, which always returns the value of its argument as it was during the initial step.

In the following example, all subordinates of the managers with the employee numbers 201 (Huber) and 202 (Bauer) or all subordinates for all managers with an employee number higher than 303, which will only be 304 (Ackermann) must be determined.

The CONNECT_BY_ROOT operator is used in this example to return the manager's employee number for the appropriate subordinate. To include the name of the manager, a sub-select is added, in which the name is determined out of the table STAFF by connecting the manager's employee number (column BOSS) with the employee number (column EMPLOYEE).

Listing 5. Hierarchical Query - CONNECT_BY_ROOT Operator
Select Connect_By_Root (b.Boss concat ' ' concat
                       (Select Name
                           From Staff a
                           where a.Employee = b.Boss)) "boss",
       Employee, Name
  From Staff b
  Start With Boss in (201, 202) or Name like '%mann%'
  Connect By Prior Employee = Boss
  Order Siblings By Employee;
Boss Employee Name
201 Huber 301 Fischer
201 Huber 401 Schulze
201 Huber 302 Müller
202 Bauer 303 Becker
202 Bauer 403 Jäger
202 Bauer 404 Schmidt
202 Bauer 405 Hund
202 Bauer 304 Ackermann
202 Bauer 406 Hermann
202 Bauer 407 Straub
304 Ackermann 406 Hermann

Pseudo Columns

For hierarchical queries several pseudo columns are provided. A pseudo column is an identifier with a predefined name that holds information about the hierarchical query and that can be used like any column in a table.

  • LEVEL – The pseudo column LEVEL returns an integer value that represents the recursive step in the hierarchy at which the row was produced.
    • For all rows produced by the start with clause LEVEL is set to 1.
    • For all rows produced by the iterations of the CONNECT BY clause LEVEL is raised by 1.

The pseudo column LEVEL can be used in the ORDER BY clause to return the result set in search breadth first sequence.

  • CONNECT_BY_ISCYCLE returns a small integer value that is set to either 0 (no cycle detected) or 1 (cycle detected).

The pseudo column can not only be used to detect cycles but also to exclude them from the result.

  • CONNECT_BY_ISLEAF returns a small integer value that is set either to 0 or 1.

The value of 1 is returned if the row is a leaf in the hierarchy, otherwise a value of 0 is returned. In hierarchical relationships, the CONNECT_BY_ISLEAF value is set to 1 for all elements that reside at the lowest level. In bi-directional relationships CONNECT_BY_ISLEAF returns the same value as CONNECT_BY_ISCYCLE.

In the following example, the pseudo column LEVEL is specified on the select list. The pseudo column value is also used to sort the query results in breadth-first sequence.

Listing 6. Hierarchical Query - Pseudo Column LEVEL
Select Connect_By_Root Boss "Boss",
       Level, Employee, Name
   From Staff
   Start With Boss = 101
   Connect By Prior Employee = Boss
   Order by Level, Employee;
Boss Level Employee Name
101 1 201 Huber
101 1 202 Bauer
101 2 301 Fischer
101 2 302 Müller
101 2 303 Becker
101 2 304 Ackermann
101 3 401 Schulze
101 3 403 Jäger
101 3 404 Schmidt
101 3 405 Hund
101 3 406 Hermann
101 3 407 Straub

Another way to use the LEVEL pseudo column is shown in the following example. The pseudo column LEVEL is used in a string to make the hierarchy visible. Depending on the current level value, the name will be indented based on the number of blanks associated with the current level.

Listing 7. Hierarchical Query - Make Hierarchy visible
Select Employee, Name, Level,
       Space((Level - 1) * 3) concat  '/'  concat Employee
                              concat ' - ' concat Trim(Name) Hierarchy
   from Staff
   Start With Boss in (201, 202)
   Connect By Prior Employee = Boss
   Order Siblings By Employee;
Employee Name Level Hierarchy
301 Fischer 1 /301 - Fischer
401 Schulze 2      /401 - Schulze
302 Müller 1 /302 - Müller
303 Becker 1 /303 - Becker
403 Jäger 2      /403 - Jäger
404 Schmidt 2      /404 - Schmidt
405 Hund 2      /405 - Hund
304 Ackermann 1 /304 - Ackermann
406 Hermann 2      /406 - Hermann
407 Straub 2      /407 - Straub

Scalar Function SYS_CONNECT_BY_PATH

The scalar function, SYS_CONNECT_BY_PATH, can only be used in composition with hierarchical queries. It builds a string containing all elements from the root to the current row. The result of the function is returned as a CLOB (Character Large Object) data type that can hold up to 1 MB of data.

The function SYS_CONNECT_BY_PATH expects 2 string expressions.

  • The first expression must be the row value that must be concatenated
  • The second expression is the separator that will be added to separate the elements.

In the following example, all flight connections between Frankfurt and Berlin are shown.

The keyword NOCYCLE specified within the START WITH … CONNECT BY clause is required because of the bi-directional data. Connections that result in a cycle are eliminated by specifying the pseudo column CONNECT_BY_ISCYCLE in the WHERE clause.

Based on the pseudo column LEVEL a maximum of two connections (LEVEL <= 3) is returned.

Because the starting point is not checked for cyclic data and 'Frankfurt' must not be connected again, it is eliminated in the CONNECT BY clause (Arrival <> 'Frankfurt').

The departure (always Frankfurt) is determined with CONNECT_BY_ROOT and concatenated with the string, generated by the SYS_CONNECT_BY_PATH function that concatenates all ARRIVALs from all iterations separated by a dash (' – ').

Listing 8. Function SYS_CONNECT_BY_PATH
Select Connect_By_Root Departure "Departure", Arrival, Level,
       Connect_By_Root Departure concat
                       Sys_Connect_By_Path(Arrival, ' - ') "Connection"
   from flights
   Where     Arrival = 'Berlin'
         and connect-By-isCycle = 0
         and Level <= 3
   Start With Departure = 'Frankfurt'
   Connect By NoCycle Prior     Arrival  = Departure
                            and Arrival <> 'Frankfurt'
   Order By Level;
Departure Arrival Level Connection
Frankfurt Berlin 1 Frankfurt - Berlin
Frankfurt Berlin 2 Frankfurt - Hannover - Berlin
Frankfurt Berlin 2 Frankfurt - Stuttgart - Berlin
Frankfurt Berlin 2 Frankfurt - München - Berlin
Frankfurt Berlin 3 Frankfurt - Hannover - Hamburg - Berlin
Frankfurt Berlin 3 Frankfurt - Hannover - Bremen - Berlin
Frankfurt Berlin 3 Frankfurt - Köln - Bremen - Berlin
Frankfurt Berlin 3 Frankfurt - Köln - Hannover - Berlin
Frankfurt Berlin 3 Frankfurt - Karlsruhe - Stuttgart - Berlin
Frankfurt Berlin 3 Frankfurt - Stuttgart - München - Berlin
Frankfurt Berlin 3 Frankfurt - München - Hannover - Berlin
Frankfurt Berlin 3 Frankfurt - München - Stuttgart - Berlin

Concatenating numeric values

Unfortunately SYS_CONNECT_BY_PATH can only be used for concatenating character values. Sometimes you would like to concatenate numeric values - for example, calculating the total costs of the flight connections.

The bad news is that there is neither a special function nor an operator for doing it directly in composition with the hierarchical query clause. The good news is there are two ways to achieve this objective, i.e., to calculate the total costs:

  • The first option is to use an RCTE
  • The second option is using the function SYS_CONNECT_BY_PATH to build a character string concatenating all costs as character values separated by a plus sign (+). And writing a simple SQL User Defined Function (UDF) that uses dynamic SQL to convert the formula passed in the string and calculate the value.

UDF to calculate the Formula passed as String

The SYS_CONNECT_BY_PATH function can only be used to concatenate strings (i.e., it cannot be used to calculate totals).

In SQL programming, a string containing any (mathematical) formula can be converted into an executable SQL statement with the SQL command PREPARE. This dynamically prepared statement can be executed using the SQL command EXECUTE.

The following SQL script contains the source code for creating the Calculate user-defined function (UDF). This UDF expects a Character Large Object (CLOB) containing a mathematical formula (for example 3+5*17) as an input parameter. This incoming parameter value will be embedded in another string to get the result of the formula calculated and returned.

The VALUES statement provides a method for executing a SQL statement without accessing a table or view. The VALUES … INTO statement allows an SQL expression to be executed and to return the result into variables. The ? (question mark) literal in the statement string is a parameter marker that represents the variable in which the result is returned.

The complete string (i.e., the incoming formula embedded in the VALUES … INTO, VALUES combination) is converted into an executable SQL statement using the SQL PREPARE statement. The prepared statement is executed with the EXECUTE statement. The variable to retrieve the result, RtnVal, is associated with the parameter marker in the USING clause.

If an error occurs, i.e., the formula contains invalid data, a continue handler will be activated and a negative default value (-999999999999.99) will be returned.

Listing 9. Calculate String UDF - Example
Create Function MYSCHEMA/CALCULATE
      (FormulaToCalc CLOB(1048576))
       Returns Decimal (15, 2)
       Language SQL
       Not Fenced
       Set Option DbgView = *Source
Begin
   Declare RtnVal     Decimal(15, 2);
   Declare DynSQLStmt CLOB(1048576);

   Declare Continue Handler for SQLEXCEPTION
           Return -999999999999,99;

   If FormulaToCalc is NULL or FormulaToCalc = ''
      Then Return 0;
   End If;

   Set DynSQLStmt = 'Values(Values(' concat Trim(FormulaToCalc)
                                     concat ')) into ?';
   Prepare DynSQL from DynSQLStmt;
   Execute DynSQL using RtnVal;
   Return RtnVal;
End; |

As soon as the UDF is generated it can be used in composition with the function SYS_CONNECT_BY_PATH to calculate totals.

In the following example, the column LISTPRICES shows the string with the concatenated costs separated by a + (plus) sign. To display the result in the Cost column, this concatenated string is passed to and processed by the Calculate UDF.

Listing 10. UDF CALCULATE and function SYS_CONNECT_BY_PATH
Select Connect_By_Root Departure "Departure", Arrival, Level,
        Connect_By_Root Departure concat Sys_Connect_By_Path(Arrival, ' - ') Connections,
        Sys_Connect_By_Path(VarChar(Price), '+') ListPrices,
        Calculate(Sys_Connect_By_Path(VarChar(Price), '+')) "Cost"
    from flights
    Where     Arrival = 'Berlin'
          and Connect_By_isCycle = 0
          and Level <= 3
    Start With Departure = 'Frankfurt'
    Connect By NoCycle Prior     Arrival  = Departure
                             and Arrival <> 'Frankfurt'
   Order By Level;
Departure Arrival Level Connections ListPrices Cost
Frankfurt Berlin 1 Frankfurt - Berlin +80,00   80,00
Frankfurt Berlin 2 Frankfurt - Hannover - Berlin +55,00+90,00 145,00
Frankfurt Berlin 2 Frankfurt - Stuttgart - Berlin +40,00+85,00 125,00
Frankfurt Berlin 2 Frankfurt - München - Berlin +65,00+90,00 155,00
Frankfurt Berlin 3 Frankfurt - Hannover - Hamburg - Berlin +55,00+50,00+80,00 185,00
Frankfurt Berlin 3 Frankfurt - Hannover - Bremen - Berlin +55,00+40,00+85,00 180,00
Frankfurt Berlin 3 Frankfurt - Köln - Bremen - Berlin +40,00+45,00+85,00 170,00
Frankfurt Berlin 3 Frankfurt - Köln - Hannover - Berlin +40,00+70,00+90,00 200,00
Frankfurt Berlin 3 Frankfurt - Karlsruhe - Stuttgart - Berlin +50,00+45,00+85,00 180,00
Frankfurt Berlin 3 Frankfurt - Stuttgart - München - Berlin +40,00+45,00+90,00 175,00
Frankfurt Berlin 3 Frankfurt - München - Hannover - Berlin +85,00+80,00+90,00 235,00
Frankfurt Berlin 3 Frankfurt - München - Stuttgart - Berlin +65,00+45,00+85,00 195,00

Conclusion

You should now be able to create all kinds of recursive queries with either hierarchical or bi-directional relationships using:

  • The hierarchical query clause (START WITH … CONNECT BY) in composition with
    • The operators PRIOR, NOCYCLE and CONNECT_BY_ROOT
    • The special ORDER SIBLINGS BY clause
    • The function SYS_CONNECT_BY_PATH
    • The Pseudo Columns LEVEL, CONNECT_BY_ISCYCLE and CONNECT_BY_ISLEAF
  • Recursive common table expressions (RCTE)

Additionally you should be able to use a UDF in composition with function SYS_CONNECT_BY_PATH to calculate totals within recursive relationships.

And now have fun experimenting with hierarchical query clauses.


Resources

IBM i - DB2 for i SQL Reference - 7.1: Chapter 5 – Queries

IBM developerWorks DB2 for i - Forum

IBM developerWorks - IBM i Technology Updates

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into IBM i on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=IBM i, Information Management
ArticleID=752135
ArticleTitle=Hierarchical Queries with DB2 Connect By
publish-date=08222011