Parsing Strings in SQL
This article was written for IBM® DB2® Universal DatabaseTM v7.1, v7.2 and v8.1 for Linux, UNIX®, and WindowsTM.
Important: Read the disclaimer before reading this article.
A common customer situation is the need to process strings that are stored in a database managed by DB2 Universial Database (UDB). The strings contain some form of concatenated data that is to be parsed and analyzed further. This article is shows how to parse the data and illustrates some possibilities for the subsequent processing of the results.
To further clarify the scenario, let's assume that we want to deal with strings like the ones that are accessible in the column FUNC_PATH of the system catalog view SYSCAT.CHECKS. This column lists the function path that was used when the check constraint was created. Listing 1 shows some sample content of this column.1
Listing 1. Querying the DB2 system catalog view SYSCAT.CHECKS
SELECT func_path FROM syscat.checks; FUNC_PATH -------------------------------------------------------- "SYSIBM","SYSFUN","SYSPROC","STOLZE" "SYSIBM","SYSFUN","SYSPROC","MYSCHEMA" "SYSIBM","SYSFUN","SYSPROC" "SYSIBM","SYSFUN","SYSPROC" 4 record(s) selected.
As you can see, each string contains several elements that are separated by a comma, which is the delimiter. The parsing I describe now will retrieve the single elements from each string. Please note that strings using other delimiters can be handled in exactly the same way with only the minor change of the delimiter itself.
There are several ways to implement the described parsing. Obviously, a user-defined table function implemented in Java or C/C++ can do the job. Such a function would take each string as input parameter and return the single elements in separate rows. Although a function implemented in C/C++ would provide the best performance, using an external programming language might not be desired as it imposes several requirements on the development environment; for example, the library implementing the user-defined function (UDF) has to be deployed for each of the target systems. SQL as a programming language is already quite capable of handling the specified task and it can be used by the database user.
Iterating over a string
The basic task to parse a string is to iterate over all the characters in the string. SQL offers two ways to do this iteration:
- FOR loops
- Recursive queries
Inline SQL PL for partinioned databases (ESE) was added in DB2 Version 8.1. Loops, as a part of inline SQL PL, are available in Version 8.1 but not yet for partinioned databases in Version 7.2. Additionally, it is more difficult to represent the extracted elements of the string in a table, which would simplify the further processing in SQL, if loops were used. Therefore, this article focuses on a general solution based on recursive queries.
The iteration in the recursive query will be used to find all the delimiters in the given string. Based on the position of the delimiters, I can then extract the substrings of the single elements. To simplify the steps, we use SQL functions for each of the tasks that need to be performed. The first function shown in Listing 2 will take a string as input parameter and find all the delimiters in the string. The function returns ascending numbers that identify each delimiter as well as its position (index) in the string. The identifier/position pairs are returned as a table with two columns.
Listing 2. A function to return the identifier and index of all delimiters
CREATE FUNCTION elemIdx ( string CLOB(64K) ) RETURNS TABLE ( ordinal INTEGER, index INTEGER ) LANGUAGE SQL DETERMINISTIC NO EXTERNAL ACTION CONTAINS SQL RETURN WITH t(ordinal, index) AS ( VALUES ( 0, 0 ) UNION ALL SELECT ordinal+1, COALESCE(NULLIF( -- find the next delimiter ',' LOCATE(',', string, index+1), 0), LENGTH(string)+1) FROM t -- to prevent a warning condition for infinite -- recursions, we add the explicit upper -- boundary for the "ordinal" values WHERE ordinal < 10000 AND -- terminate if there are no further delimiters -- remaining LOCATE(',', string, index+1) <> 0 ) SELECT ordinal, index FROM t UNION ALL -- add indicator for the end of the string SELECT MAX(ordinal)+1, LENGTH(string)+1 FROM t ;
Note that the function does not only return the index for each delimiter but also adds two rows, one for the index 0 (zero) indicating the beginning of the string, and another that indicates the end of the string. This will allow for an easy extraction of the first and last elements.
Listing 3 illustrates the usage and the results of the function for the string "abc, def, ghi, 123".
Listing 3. Sample output of the function elemIdx()
SELECT * FROM TABLE ( elemIdx('abc, def, ghi, 123') ) AS t ORDER BY 1; ORDINAL INDEX ----------- ----------- 0 0 1 4 2 9 3 14 4 19 5 record(s) selected.
The results show that the string is 19 characters long (index 4) and that a delimiting comma can be found at the positions 4, 9, and 14. As mentioned before, the first row only indicates the beginning of the string.
Extracting the substrings
Using the index information returned by the function
elemIdx(), I can now extract the elements from the string. To that end, the indexes are determined and then used as parameters for the DB2 built-in function SUBSTR. Listing 4 shows how the function could look.
Listing 4. Function to return the single elements
CREATE FUNCTION elements ( string CLOB(64K) ) RETURNS TABLE ( elements CLOB(64K) ) LANGUAGE SQL DETERMINISTIC NO EXTERNAL ACTION CONTAINS SQL RETURN WITH t(ordinal, index) AS ( SELECT ordinal, index FROM TABLE ( elemIdx(string) ) AS x ) SELECT SUBSTR(string, t1.index+1, t2.index - t1.index - 1) -- the join below makes sure that we have the lower and -- upper index where we can find each of the ',' delimiters -- that are separating the elements. (For this, we exploit -- the additional indexes pointing to the beginning and end -- of the string.) FROM t AS t1 JOIN t AS t2 ON ( t2.ordinal = t1.ordinal+1 ) ;
The join condition ensures that we combine the correct upper and lower bounds for each element. The lower bound then determines where the element starts, and the upper bound where it ends. The SUBSTR() function expects the length of the string to be extracted, and the length is calculated from the upper bound and the lower bound using simple subtraction. The same string as in Listing 3 is now used again. Listing 5 shows a sample query where we extract the single elements from the input string. The second query illustrates the use of the DB2 built-in functions RTRIM() and LTRIM() to get rid of the leading and trailing whitespaces. That way, the later processing will not consider two elements to be different, purely based on the whitespaces.
Listing 5. Sample output of the function elements()
SELECT VARCHAR(elem, 20) FROM TABLE ( elements('abc, def, ghi, 123') ) AS t(elem); 1 -------------------- abc def ghi 123 4 record(s) selected. SELECT VARCHAR(LTRIM(RTRIM(elem)), 20), LENGTH(LTRIM(RTRIM(elem))), LENGTH(elem) FROM TABLE ( elements('abc, def , ghi, 123') ) AS t(elem); 1 2 3 -------------------- ----------- ----------- abc 3 3 def 3 5 ghi 3 4 123 3 4 4 record(s) selected.
As you can see, all of the elements are returned in separate rows. Any spaces in each element are preserved.
To close this article, I present a few sample queries that show how the results from the function
elements() can be included in your queries to actually perform the required analysis of the strings and, thus achieve the original goal.
The examples will all use the table and data shown in Listing 6. After you have created the two functions described above, you can run the queries and will receive the results presented in this section.
Listing 6. Creating the table and inserting example data
CREATE TABLE strings ( id INTEGER NOT NULL PRIMARY KEY, str VARCHAR(128) NOT NULL ); INSERT INTO strings VALUES ( 1, 'abc, def, ghi, 123' ), ( 2, '123,456789,abc,123' ), ( 3, 'a,b,c,a,b,c,a' ), ( 4, 'string' ); SELECT * FROM strings; ID STR ----------- ------------------------------------------------- 1 abc, def, ghi, 123 2 123,456789,abc,123 3 a,b,c,a,b,c,a 4 string 4 record(s) selected.
Counting how many distinct elements there are in a string can be accomplished with the query shown in Listing 3. The first query uses a string consisting of four elements where two are identical. In other words, there are three distinct elements. The second query in Listing 7 shows the results for the same query run against the table strings.
Listing 7. Counting distinct elements in a string
SELECT COUNT(DISTINCT VARCHAR(RTRIM(LTRIM(elem)), 1000)) FROM TABLE ( elements('abc, def,abc ,ghi') ) AS x(elem); 1 ----------- 3 1 record(s) selected. SELECT id, COUNT(DISTINCT VARCHAR(RTRIM(LTRIM(elem)), 20)) FROM strings, TABLE ( elements(str) ) AS t(elem) GROUP BY id; ID 2 ----------- ----------- 1 4 2 3 3 3 4 1 4 record(s) selected.
In another situation, you might want to know how often the element '123' appears in the strings. Listing 8 shows how such a query could be formulated.
Listing 8. Counting the occurrences of a given string
SELECT id, ( SELECT COUNT(*) FROM TABLE ( elements(str) ) AS x(elem) WHERE VARCHAR(RTRIM(LTRIM(elem)), 100) = '123' ) FROM strings; ID 2 ----------- ----------- 1 1 2 2 3 0 4 0 4 record(s) selected.
As is apparent in the queries, I usually have a join between the table containing the strings to be parsed, and the table function
elements(). The processing applied by DB2 for those table functions needs some more elaboration as they work differently from normal joins based on columns from two base tables or views. During the join processing, DB2 will do the following:
- Fetch one row from the table named strings
- Give the STR values from that row to the table function
- All rows returned by elements() for this single input parameter are now related to the row fetched in step (1) and not to any other row
- Go to step 1 and fetch the next row until all rows are processed
This article showed how strings that consist of several elements separated by a delimiter can be parsed and processed using only SQL constructs. To that end, we defined two functions,
elements(), which will calculate the positions of the delimiters in the string and, based on that information, extract the actual elements as substrings. Several examples are shown to illustrate the results and how they can be used in more complex SQL statements.
This article contains sample code. IBM grants you ("Licensee") a non-exclusive, royalty free, license to use this sample code. However, the sample code is provided as-is and without any warranties, whether EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE THAT RESULT FROM YOUR USE OF THE SOFTWARE. IN NO EVENT WILL IBM OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USE OF OR INABILITY TO USE SOFTWARE, EVEN IF IBM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
1All SQL statements shown in this article use a semicolon as statement terminator. If you execute the statement from the DB2 command line, then you have to start the command line using
db2 -t. When using the DB2 Command Center, you can adjust the statement terminator in the menu Tools -> Tools Settings -> Use statement termination character. Alternatively, you can change the terminator for statements to be executed.