Develop with XQuery: A better programming language for the database programmer

Reap the benefits of XQuery with quicker development and easier maintenance

Most programmers think the XQuery language was developed to satisfy a niche market: A data querying and transformation language designed to handle XML data. In the case of relational databases, the prevailing practice is to use SQL for non-XML data and use XQuery for XML. This article makes the case that the powerful programming constructs available in the XQuery language make it a better programming language than SQL, and that this improvement in expressiveness and ease of use is enough to warrant the design of databases with an increasing emphasis on XML data types.

Share:

Kenneth Stephen (kstephe@us.ibm.com), Software Engineer, IBM

Photo of Kenneth StephenKenneth Stephen is an application architect who has 20 years of experience designing and implementing applications on platforms ranging from the PC to the mainframe. He has lots of experience designing and implementing applications using XML technologies, including XSLT and XQuery. Kenneth currently works in IBM Software Services for WebSphere.



14 February 2012

Also available in Chinese Russian Japanese Spanish

Overview

SQL was invented so programmers can abstract out code algorithms meant to handle, manipulate, or transform the data, and not lose the implementation of the algorithm in the details of getting to the data or persisting it. In this goal, SQL has been tremendously successful. It allows the creation of very complex OLTP applications, reporting systems, data warehouses, and business analytics. SQL and the relational model of database design go hand in hand, and the power and flexibility of SQL have made the relational model the dominant data model type for the past many decades.

Frequently used acronyms

  • OLTP: Online transaction processing
  • RDBMS: Relational database management systems
  • SQL: Structured Query Language

Lost in this story of success is the story of hierarchical databases. These databases had the data structured into forests of data—where record types could have parent-child relationships with each other. The programming languages used with these databases weren't as successful as SQL in allowing programmers to abstract out the mechanics of data access and persistence, and this limitation led to their decline.

XML documents represent relationships between entities using a hierarchy. XQuery is used to work with XML data in one or more documents. And XQuery lets you abstract away the mechanics of data access and persistence just as well as SQL, while providing programming facilities that are better than those of SQL. Technology has come full circle in terms of how to work with data at a high level. XQuery has some interesting aspects that make it more attractive to the programmer. In this article, I explore these benefits.

Working with hierarchical data

First, look at some situations where XQuery has an obvious advantage over SQL. In an RDBMS, it is common to find parent-child relationships represented in two separate tables of a database. For example, a purchase order with multiple items being purchased might be represented in the database through a purchase_order table and an items table. (The purchase_order table includes both order numbers and customer numbers. The items table includes order numbers and item numbers.) The items table also has a foreign key relationship on the purchase_order table. SQL handles this sort of relationship very well. See Figure 1 and Listing 1.

Figure 1. Parent-child table design for purchase order
Image showing the parent (purchase_order) and child (items) tables
Listing 1. SQL statement to find all items in a given purchase order
select item_no, item_desc 
from purchase_order po, items i 
where ord_no = 'A12345' 
and po.ord_no = i.ord_no

SQL shows its weakness when the data has a chain of relationships that are of unpredictable length. For example, a manager can have many employees, who might, in turn, each have many employees themselves, and so on. In this case, representing the data in separate parent-child tables is not practical. For one, you have to design the tables to the maximum length of the relationship—if the management chain can be at most six levels tall, then you must design six tables. This practice is ugly. What is worse is that to get to all the employees at level n from a manager at level x, you must do a join of the tables from level x through level n. This approach will be expensive in terms of the resources used.

The typical solution adopted is to have a single table where the rows of the table have a parent-child relationship with one another as in Figure 2. Columns for employee ID, manager ID, and employee name indicate the relationships. In this example, Jack Brown manages 2 employees, John Silver and Ron McDonald. John Silver manages Jon Carino. Unfortunately, querying data like this in XML results in queries that are hard to write and maintain. See Listing 2 for the code to create a SQL statement used to find employees reporting directly or indirectly to a manager.

Figure 2. Single table design for employee data with interrelated rows
Image showing the single table design for employee data
Listing 2. SQL statement to find employees reporting directly or indirectly to a manager
with reporting_to(emp_id, emp_name) as 
( 
     select emp_id, emp_name 
     from employees 
     where emp_name = 'John Silver' 

     union all 

     select direct_reports.emp_id, direct_reports.emp_name 
     from employees direct_reports, reporting_to 
     where reporting_to.emp_id = direct_reports.mgr_id 
) 
select emp_id, emp_name 
from reporting_to

As you can see, this query uses recursion and requires significant effort to understand. This problem is solved much better using XML data types and XQuery. In this case, you represent the entire organization in a single XML document in one row of the table. Listings 3 and 4 show the resulting implementation, and you get to judge which one is programmatically easier to construct and maintain.

Listing 3. XML representation of employee data
<?xml version="1.0"?> 
<org> 
     <employee id="0001"> 
          <name>Jack Brown</name> 
          <employee id="0002"> 
               <name>John Silver</name> 
               <employee id="0004"> 
                         <name>John Silver</name> 
               </employee> 
          </employee> 
          <employee id="0003"> 
               <name>Ron McDonald</name> 
          </employee> 
     </employee>
</org>
Listing 4. Finding the list of employees when using an XML data type
select emp.emp_id, emp.emp_name 
from employees, xmltable( 
     '$ORG/org//employee[name = "John Silver"]/descendant-or-self::employee' 
     columns 
          emp_id char(4) path '@id', 
          emp_name varchar(254) path 'name/text()' 
) emp

Also note that hierarchical data is more common than you might think. For example, in the previous example, purchase orders can't contain other purchase orders, so on the surface, it seems that this example would never be hierarchical. A purchase order can fail; that is, out of a list of 20 items purchased, 2 items might not be available or might not be available in sufficient quantity. Thus, the existing purchase order closes with the available items being supplied and a new purchase order is generated for the unavailable items. In such situations, it is desirable for the new purchase order to contain a link to the old one. Then you have a chain of purchase orders whose relationship links look very much like the employee data just discussed.

Working with non-hierarchical data

Non-hierarchical data represents the larger chunk of data that is modeled in the relational world. It is important that any programming language used to access the database or to write to the database be able to handle these data entities very well. In the examples that follow, I describe how XQuery handles such data very well and, from a programmer's standpoint, provides a superior solution to SQL.

Case study 1: Finding the closest match and using it

This example deals with the case in which programming tasks are organized into competitions ("events"). For each such event, the creator or manager of the event assigns a certain amount of "points" to it. A look-up table of price points is provided, with each price point given an integer label called a "level." The problem is to find the price point that has the closest match to the "points" value of the event and to use that value as the "price" of the level (how much the winning competitor of the contest gets paid). If two levels become the closest match, then you use the higher level to determine the price point that applies.

Listings 5 and 6 show the standard database design and a sampling of the data.

Listing 5. Database table design using non-XML data types
[db2pe@lc4eb4168274532 code]$ db2 describe table pricing 
                                Data type                     Column 
Column name                     schema    Data type name      Length     Scale Nulls 
------------------------------- --------- ------------------- ---------- ----- ------ 
EVENT_ID                        SYSIBM    INTEGER                      4     0 No 
POINTS                          SYSIBM    INTEGER                      4     0 Yes 
PRICE                           SYSIBM    INTEGER                      4     0 Yes 

     3 record(s) selected. 

[db2pe@lc4eb4168274532 code]$ db2 describe table pricing_tier 
                                Data type                     Column 
Column name                     schema    Data type name      Length     Scale Nulls 
------------------------------- --------- ------------------- ---------- ----- ------ 
LEVEL                           SYSIBM    INTEGER                      4     0 No 
PAYMENT                         SYSIBM    INTEGER                      4     0 Yes 

     2 record(s) selected.
Listing 6. Sample data from the tables
[db2pe@lc4eb4168274532 code]$ db2 "select * from pricing fetch first 5 rows only" 

EVENT_ID    POINTS      PRICE 
---------- ---------- ----------- 
      10472         640           0 
      10471         220           0 
      10470         190           0 
      10469         180           0 
      10466         780           0 

     5 record(s) selected. 

[db2pe@lc4eb4168274532 code]$ db2 "select * from pricing_tier" 

LEVEL       PAYMENT 
----------- ----------- 
          0          60 
          1         120 
          2         240 
          3         360 
          4         480 
          5         600 
          6         720 
          7         840 

     8 record(s) selected.

Listing 7 shows the required SQL update statement.

Listing 7. Update statement to update the pricing table using standard SQL
update pricing o 
    set (price) = ( 
        select payment 
        from ( 
            select pricing_info.event_id event_id , max(level) matched_level 
            -- "min_values" will have the smallest difference between the current 
            -- price and all the pre-defined price points for each event. "pricing_info" 
            -- will contain an index of all the price point differences tabulated 
            -- by the price point level. A join of these two tables by event_id 
            -- and price point difference, should get you the price point level that 
            -- you are seeking for each event. 
            from ( 
                select event_id, min(absdiff) 
                from ( 
                    -- For each event_id, calculate the absolute difference 
                    -- between the existing price and the price points 
                    select event_id, abs(points - payment) absdiff, level 
                    from pricing, pricing_tier 
                ) 
                group by event_id 
            ) as min_values(event_id, closest_match), ( 
                -- For each event_id, calculate the absolute difference 
                -- between the existing price and the price points 
                select event_id, abs(points - payment) absdiff, level 
                from pricing, pricing_tier 
            ) as pricing_info(event_id, absdiff, level)
            where min_values.event_id = pricing_info.event_id 
            and closest_match = absdiff 
            group by pricing_info.event_id 
        )x , pricing_tier y 
        where x.matched_level = y.level 
        and x.event_id = o.event_id 
    )

Now look at an implementation using XML data types. The pricing_tier table and event details can be represented with the XML documents in Listings 8 and 9.

Listing 8. Pricing information as an XML document
<pricing> 
     <pricingtier level="0"><price>60</price></pricingtier> 
     <pricingtier level="1"><price>120</price></pricingtier> 
     <pricingtier level="2"><price>240</price></pricingtier> 
     <pricingtier level="3"><price>360</price></pricingtier> 
     <pricingtier level="4"><price>480</price></pricingtier> 
     <pricingtier level="5"><price>600</price></pricingtier> 
     <pricingtier level="6"><price>720</price></pricingtier> 
     <pricingtier level="7"><price>840</price></pricingtier> 
</pricing>
Listing 9. Event details as an XML document
<event id="9083" eventstate="Cancelled: Client request"> 
     <title>UCD research 5</title> 
     <points>170</points> 
</event>

Listing 10 shows the table design used.

Listing 10. Database table design when using XML data types
[db2pe@lc4eb4168274532 code]$ db2 describe table events 
                                Data type                     Column 
Column name                     schema    Data type name      Length     Scale Nulls 
------------------------------- --------- ------------------- ---------- ----- ------ 
EVENT_ID                        SYSIBM    INTEGER                      4     0 No 
EVENT                           SYSIBM    XML                          0     0 No 

     2 record(s) selected. 

[db2pe@lc4eb4168274532 code]$ db2 describe table pricing 
                                Data type                     Column 
Column name                     schema    Data type name      Length     Scale Nulls 
------------------------------- --------- ------------------- ---------- ----- ------ 
PRICING                         SYSIBM    XML                          0     0 Yes 

     1 record(s) selected.

This design assumes that the pricing information is inserted into the document as a peer element of the "points" element. Listing 11 shows the update statement that does this.

Listing 11. Update statement using XML data types and XQuery
update events o 
   set (event) = ( 
      select xmlquery(' 
         (: 
          : First build a series of "pair" elements which map a pricing level number 
          : to a value that is the absolute value of the difference between the 
          : event points and the price point. Once you have this sequence, sort by 
          : the difference and pick the lowest value. This results in you picking the 
          : closest price match. Because you break ties for the closest value by picking 
          : the higher level, you add a second sort key (level) in descending order.
          :) 
         let $closestMatch := 
            ( 
               for $pair in ( for $p in $PRICING/pricing/pricingtier 
                  let $lp := xs:int($EVENT/event/points) 
                  let $absdiff := abs($lp - xs:int($p/price)) 
                  return 
                     <pair> 
                        <level>{$p/@level}</level> 
                        <diff>{$absdiff}</diff> 
                     </pair> 
               ) 
               order by xs:int($pair/diff/text()), 
                  xs:int($pair/level/text()) descending 
               return $pair 
            )[1] 
         return 
            transform 
               copy $e := $EVENT 
            modify 
               do insert 
               <pricing>{ 
                  $PRICING/pricing/pricingtier[@level = $closestMatch/level/@level]/ 
                     price/text() 
               }</pricing> after $e/event/points 
            return $e 
         ') 
         from events a, (select pricing from pricing fetch first row only) b 
         where a.event_id = o.event_id 
      )

The main thing to note here is that the programming style very much resembles procedural programming—where intermediate calculations are done and the results are assigned to variables, which are then reused elsewhere. There are even nested for loops, which come naturally to a procedural programmer. The SQL statement in Listing 7 is identical in function and methodology but is harder to follow because the programming style is far removed from the normal procedural style. The XQuery style is quite easy to construct and maintain, leading to improved programmer productivity.

Case study 2: Finding related events in temporal data

Temporal data is a representation of the state of an entity as one or more aspects of it that vary over time. There are many ways to capture this changing state. For the sake of simplicity, focus on change in a single variable in the example that follows. Look at a data operation that is quite common in the world of data mining: At a high level, you are trying to identify a set of entities whose changing state has certain characteristics of interest. This set might be, for example, a customer who purchases detergent soon after purchasing soup. In this case, you will look at the extension of an earlier example. You are looking for employees who had more than one promotion during the 2000 through 2009 time period. See Listing 12.

Listing 12. Table structure capturing employee promotion history
[db2pe@lc4eb4168274532 code]$ db2 describe table pay_history 

                                Data type                     Column 
Column name                     schema    Data type name      Length     Scale Nulls 
------------------------------- --------- ------------------- ---------- ----- ------ 
EMP_ID                          SYSIBM    CHARACTER                    4     0 No 
START_TIME                      SYSIBM    TIMESTAMP                   10     6 No 
END_TIME                        SYSIBM    TIMESTAMP                   10     6 Yes 
JOB_LEVEL                       SYSIBM    INTEGER                      4     0 No 

     4 record(s) selected.

Listing 13 shows the SQL statement that you need.

Listing 13. Select statement to identify employees with more than one promotion in specified period
select emp_id, count(emp_id) 
from pay_history 
where start_time > '2000-01-01-00.00.00.000000' 
and end_time < '2010-01-01-00.00.00.000000' 
group by emp_id 
having count(emp_id) > 1

Listing 14 shows the XML datatype-based table design.

Listing 14. Table structure capturing employee promotion history using XML data types
[db2pe@lc4eb4168274532 code]$ db2 describe table pay_history 
                                Data type                     Column 
Column name                     schema    Data type name      Length     Scale Nulls 
------------------------------- --------- ------------------- ---------- ----- ------ 
EMP_ID                          SYSIBM    CHARACTER                    4     0 No 
HISTORY                         SYSIBM    XML                          0     0 No 

     2 record(s) selected.

Listing 15 shows the XML data representing the job history.

Listing 15. Table structure capturing employee promotion history using XML data types
<employee id="0001"> 
     <pay startDate="2001-11-23T00:00:00.000000Z" endDate="2002-10-07T00:00:00.000000Z" 
          level="4">70500</pay> 
     <pay startDate="2002-10-07T00:00:00.000000Z" endDate="2005-06-18T00:00:00.000000Z" 
          level="5">81500</pay> 
     <pay startDate="2005-06-18T00:00:00.000000Z" endDate="2007-06-01T00:00:00.000000Z" 
          level="6">96700</pay> 
     <pay startDate="2007-06-01T00:00:00.000000Z" level="7">120000</pay> 
</employee>

Listing 16 shows the XQuery-based select statement that is the equivalent of the select statement in Listing 13.

Listing 16. Table structure capturing employee promotion history using XML data types
select emp_id 
from pay_history 
where xmlexists(' 
     let $numPromotions := count( 
               $HISTORY/employee/pay[@startDate gt "2000-01-01T00:00:00.000000Z" 
                    and @endDate lt "2010-01-01T00:00:00.000000Z"] 
          ) 
     return 
          if($numPromotions gt 1)then 
          ( 
               true() 
          )else() 
')

There doesn't seem to be any improvement in the XQuery form. Let's consider an additional requirement. Employees can be demoted or promoted. Each job history record might represent a demotion as well as a promotion. If you focused only on promotions and excluded all demotions, the XQuery statement now changes to the code in Listing 17.

Listing 17. XQuery focusing only on promotions processing the job history
select emp_id 
from pay_history 
where xmlexists(' 
   let $i := 0 
   (: 
    : You have to work around the fact that the DB2 pureXML implementation of XQuery 
    : does not support the preceding-sibling axis. To do this, iterate through the 
    : "pay" elements for a given employee, and tack on the position of the element. 
    : Later on, use this position to determine the previous "pay" element. 
    :) 
   let $pairs := ( 
      for $jobChanges in $HISTORY/employee/pay[@startDate gt 
         "2000-01-01T00:00:00.000000Z" and @endDate lt "2010-01-01T00:00:00.000000Z"] 
      let $i := $i + 1 
      return 
         <pair><position>{$i}</position>{$jobChanges}</pair> 
   ) 
   let $numPromotions := count( 
      for $p in $pairs 
      let $currentPos := xs:int($p/position) 
      return 
         (: If this is the first "pay" element, its not a demotion :) 
         if($currentPos eq 1)then 
         ( 
            $p/pay 
         )else( 
            if($pairs[$currentPos - 1]/data(@level) lt $p/data(@level))then 
            ( 
               $p/pay 
            )else() 
         ) 
   ) 
   return 
      if($numPromotions gt 1)then 
      ( 
         true() 
      )else() 
')

The equivalent SQL statement, using non-XML data types, now requires a self-join to implement. Even then it will be quite complex and hard to construct. A mitigation might be to add a "job history row number" column to the pay_history table and then allow the self-join to be done on this column. This approach is left as an exercise for the reader.

Summary

The ideas conveyed in this article are not meant to be an enumeration of what is possible with XQuery. Rather, the examples are meant to illustrate how the powerful facilities provided by the language enable you to develop applications faster and to make them easier to maintain. This is a good reason, in this author's opinion, to encourage the increasing adoption of XML data types in databases.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=792889
ArticleTitle= Develop with XQuery: A better programming language for the database programmer
publish-date=02142012