This article shows how to parse data strings and it also illustrates some possibilities for the subsequent processing of the data.

Share:

Knut Stolze, IBM Information Integration, IBM Germany

Knut Stolze started his work with DB2 when he joined IBM as a visiting scientist at the Silicon Valley Lab where he worked on the DB2 Image Extender. He moved on to the DB2 Spatial Extender Version 8 and was responsible for several enhancements to improve the usability, the performance, and the standard-conformance of the Extender for more than two years. Currently, he works as a teaching assistant at the University of Jena, Germany, and continues his work for IBM in the area of federated databases. He can be reached through the newsgroups comp.databases.ibm-db2 and ibm.software.db2.udb.spatial or at stolze@de.ibm.com.



27 March 2003

This article was written for IBM® DB2® Universal DatabaseTM v7.1, v7.2 and v8.1 for Linux, UNIX®, and WindowsTM.

Important: Read the disclaimer before reading this article.

Introduction

IBM DB2 e-kit for Database Professionals

Learn how easy it is to get trained and certified for DB2 for Linux, UNIX, and Windows with the IBM DB2 e-kit for Database Professionals. Register now, and expand your skills portfolio, or extend your DBMS vendor support to include DB2.

A common customer situation is the need to process strings that are stored in a database managed by DB2 Universial Database (UDB). The strings contain some form of concatenated data that is to be parsed and analyzed further. This article is shows how to parse the data and illustrates some possibilities for the subsequent processing of the results.

To further clarify the scenario, let's assume that we want to deal with strings like the ones that are accessible in the column FUNC_PATH of the system catalog view SYSCAT.CHECKS. This column lists the function path that was used when the check constraint was created. Listing 1 shows some sample content of this column.1

Listing 1. Querying the DB2 system catalog view SYSCAT.CHECKS
SELECT func_path FROM syscat.checks; 
 
FUNC_PATH 
-------------------------------------------------------- 
"SYSIBM","SYSFUN","SYSPROC","STOLZE" 
"SYSIBM","SYSFUN","SYSPROC","MYSCHEMA" 
"SYSIBM","SYSFUN","SYSPROC" 
"SYSIBM","SYSFUN","SYSPROC" 
 
  4 record(s) selected.

As you can see, each string contains several elements that are separated by a comma, which is the delimiter. The parsing I describe now will retrieve the single elements from each string. Please note that strings using other delimiters can be handled in exactly the same way with only the minor change of the delimiter itself.

There are several ways to implement the described parsing. Obviously, a user-defined table function implemented in Java or C/C++ can do the job. Such a function would take each string as input parameter and return the single elements in separate rows. Although a function implemented in C/C++ would provide the best performance, using an external programming language might not be desired as it imposes several requirements on the development environment; for example, the library implementing the user-defined function (UDF) has to be deployed for each of the target systems. SQL as a programming language is already quite capable of handling the specified task and it can be used by the database user.


Iterating over a string

The basic task to parse a string is to iterate over all the characters in the string. SQL offers two ways to do this iteration:

  • FOR loops
  • Recursive queries

Inline SQL PL for partinioned databases (ESE) was added in DB2 Version 8.1. Loops, as a part of inline SQL PL, are available in Version 8.1 but not yet for partinioned databases in Version 7.2. Additionally, it is more difficult to represent the extracted elements of the string in a table, which would simplify the further processing in SQL, if loops were used. Therefore, this article focuses on a general solution based on recursive queries.

The iteration in the recursive query will be used to find all the delimiters in the given string. Based on the position of the delimiters, I can then extract the substrings of the single elements. To simplify the steps, we use SQL functions for each of the tasks that need to be performed. The first function shown in Listing 2 will take a string as input parameter and find all the delimiters in the string. The function returns ascending numbers that identify each delimiter as well as its position (index) in the string. The identifier/position pairs are returned as a table with two columns.

Listing 2. A function to return the identifier and index of all delimiters
CREATE FUNCTION elemIdx ( string CLOB(64K) ) 
   RETURNS TABLE ( ordinal INTEGER, index INTEGER ) 
   LANGUAGE SQL 
   DETERMINISTIC 
   NO EXTERNAL ACTION 
   CONTAINS SQL 
   RETURN 
      WITH t(ordinal, index) AS 
         ( VALUES ( 0, 0 ) 
           UNION ALL 
           SELECT ordinal+1, COALESCE(NULLIF( 
                     -- find the next delimiter ',' 
                     LOCATE(',', string, index+1), 0), 
                     LENGTH(string)+1) 
           FROM   t 
                  -- to prevent a warning condition for infinite 
                  -- recursions, we add the explicit upper 
                  -- boundary for the "ordinal" values 
           WHERE  ordinal < 10000 AND 
                  -- terminate if there are no further delimiters 
                  -- remaining 
                  LOCATE(',', string, index+1) <> 0 ) 
      SELECT ordinal, index 
      FROM   t 
      UNION ALL 
      -- add indicator for the end of the string 
      SELECT MAX(ordinal)+1, LENGTH(string)+1 
      FROM   t 
;

Note that the function does not only return the index for each delimiter but also adds two rows, one for the index 0 (zero) indicating the beginning of the string, and another that indicates the end of the string. This will allow for an easy extraction of the first and last elements.

Listing 3 illustrates the usage and the results of the function for the string "abc, def, ghi, 123".

Listing 3. Sample output of the function elemIdx()
SELECT * 
FROM TABLE ( elemIdx('abc, def, ghi, 123') ) AS t 
ORDER BY 1; 
 
ORDINAL     INDEX 
----------- ----------- 
          0           0 
          1           4 
          2           9 
          3          14 
          4          19 
 
  5 record(s) selected.

The results show that the string is 19 characters long (index 4) and that a delimiting comma can be found at the positions 4, 9, and 14. As mentioned before, the first row only indicates the beginning of the string.


Extracting the substrings

Using the index information returned by the function elemIdx(), I can now extract the elements from the string. To that end, the indexes are determined and then used as parameters for the DB2 built-in function SUBSTR. Listing 4 shows how the function could look.

Listing 4. Function to return the single elements
CREATE FUNCTION elements ( string CLOB(64K) ) 
   RETURNS TABLE ( elements CLOB(64K) ) 
   LANGUAGE SQL 
   DETERMINISTIC 
   NO EXTERNAL ACTION 
   CONTAINS SQL 
   RETURN 
      WITH t(ordinal, index) AS 
         ( SELECT ordinal, index 
           FROM   TABLE ( elemIdx(string) ) AS x ) 
      SELECT SUBSTR(string, t1.index+1, t2.index - t1.index - 1) 
      -- the join below makes sure that we have the lower and 
      -- upper index where we can find each of the ',' delimiters 
      -- that are separating the elements.  (For this, we exploit 
      -- the additional indexes pointing to the beginning and end 
      -- of the string.) 
      FROM   t AS t1 JOIN t AS t2 ON 
                ( t2.ordinal = t1.ordinal+1 ) 
;

The join condition ensures that we combine the correct upper and lower bounds for each element. The lower bound then determines where the element starts, and the upper bound where it ends. The SUBSTR() function expects the length of the string to be extracted, and the length is calculated from the upper bound and the lower bound using simple subtraction. The same string as in Listing 3 is now used again. Listing 5 shows a sample query where we extract the single elements from the input string. The second query illustrates the use of the DB2 built-in functions RTRIM() and LTRIM() to get rid of the leading and trailing whitespaces. That way, the later processing will not consider two elements to be different, purely based on the whitespaces.

Listing 5. Sample output of the function elements()
SELECT VARCHAR(elem, 20) 
FROM   TABLE ( elements('abc, def, ghi, 123') ) AS t(elem); 
 
1 
-------------------- 
abc 
 def 
 ghi 
 123 
 
  4 record(s) selected. 
 
 
 
 SELECT VARCHAR(LTRIM(RTRIM(elem)), 20), 
        LENGTH(LTRIM(RTRIM(elem))), LENGTH(elem) 
 FROM   TABLE ( elements('abc, def , ghi, 123') ) AS t(elem); 
 
 1                    2           3 
 -------------------- ----------- ----------- 
 abc                            3           3 
 def                            3           5 
 ghi                            3           4 
 123                            3           4 
 
   4 record(s) selected.

As you can see, all of the elements are returned in separate rows. Any spaces in each element are preserved.


Usage scenarios

To close this article, I present a few sample queries that show how the results from the function elements() can be included in your queries to actually perform the required analysis of the strings and, thus achieve the original goal.

The examples will all use the table and data shown in Listing 6. After you have created the two functions described above, you can run the queries and will receive the results presented in this section.

Listing 6. Creating the table and inserting example data
CREATE TABLE strings ( 
   id   INTEGER       NOT NULL  PRIMARY KEY, 
   str  VARCHAR(128)  NOT NULL 
); 
 
INSERT 
INTO   strings 
VALUES ( 1, 'abc, def, ghi, 123' ), 
       ( 2, '123,456789,abc,123' ), 
       ( 3, 'a,b,c,a,b,c,a' ), 
       ( 4, 'string' ); 
 
SELECT * FROM strings; 
 
ID          STR 
----------- ------------------------------------------------- 
          1 abc, def, ghi, 123 
          2 123,456789,abc,123 
          3 a,b,c,a,b,c,a 
          4 string 
 
  4 record(s) selected.

Counting how many distinct elements there are in a string can be accomplished with the query shown in Listing 3. The first query uses a string consisting of four elements where two are identical. In other words, there are three distinct elements. The second query in Listing 7 shows the results for the same query run against the table strings.

Listing 7. Counting distinct elements in a string
SELECT COUNT(DISTINCT VARCHAR(RTRIM(LTRIM(elem)), 1000)) 
FROM   TABLE ( elements('abc, def,abc ,ghi') ) AS x(elem); 
 
1 
----------- 
          3 
 
  1 record(s) selected. 
 
 
SELECT id, COUNT(DISTINCT VARCHAR(RTRIM(LTRIM(elem)), 20)) 
FROM   strings, TABLE ( elements(str) ) AS t(elem) 
GROUP BY id; 
 
ID          2 
----------- ----------- 
          1           4 
          2           3 
          3           3 
          4           1 
 
  4 record(s) selected.

In another situation, you might want to know how often the element '123' appears in the strings. Listing 8 shows how such a query could be formulated.

Listing 8. Counting the occurrences of a given string

SELECT id, ( SELECT COUNT(*) 
             FROM   TABLE ( elements(str) ) AS x(elem) 
             WHERE  VARCHAR(RTRIM(LTRIM(elem)), 100) = '123' ) 
FROM   strings; 
 
ID          2 
----------- ----------- 
          1           1 
          2           2 
          3           0 
          4           0 
 
  4 record(s) selected.

As is apparent in the queries, I usually have a join between the table containing the strings to be parsed, and the table function elements(). The processing applied by DB2 for those table functions needs some more elaboration as they work differently from normal joins based on columns from two base tables or views. During the join processing, DB2 will do the following:

  1. Fetch one row from the table named strings
  2. Give the STR values from that row to the table function elements()
  3. All rows returned by elements() for this single input parameter are now related to the row fetched in step (1) and not to any other row
  4. Go to step 1 and fetch the next row until all rows are processed

Summary

This article showed how strings that consist of several elements separated by a delimiter can be parsed and processed using only SQL constructs. To that end, we defined two functions, elemIdx() and elements(), which will calculate the positions of the delimiters in the string and, based on that information, extract the actual elements as substrings. Several examples are shown to illustrate the results and how they can be used in more complex SQL statements.


Disclaimer

This article contains sample code. IBM grants you ("Licensee") a non-exclusive, royalty free, license to use this sample code. However, the sample code is provided as-is and without any warranties, whether EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE THAT RESULT FROM YOUR USE OF THE SOFTWARE. IN NO EVENT WILL IBM OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USE OF OR INABILITY TO USE SOFTWARE, EVEN IF IBM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.


Footnotes

1All SQL statements shown in this article use a semicolon as statement terminator. If you execute the statement from the DB2 command line, then you have to start the command line using db2 -t. When using the DB2 Command Center, you can adjust the statement terminator in the menu Tools -> Tools Settings -> Use statement termination character. Alternatively, you can change the terminator for statements to be executed.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=14095
ArticleTitle=Parsing Strings in SQL
publish-date=03272003