DB2 10.5 for Linux, UNIX, and Windows

Regular expressions

A regular expression is a sequence of characters that act as a pattern for matching and manipulating strings.

Regular expressions are used in the following XQuery functions: fn:matches, fn:replace, and fn:tokenize. DB2® XQuery regular expression support is based on the XML schema regular expression support as defined in the W3C Recommendation XML Schema Part 2: Datatypes Second Edition with extensions as defined by W3C Recommendation XQuery 1.0 and XPath 2.0 Functions and Operators.

Syntax


RegularExpression

                     .--------------------------.   
    (1)              V                          |   
>>-------| Branch |----+----------------------+-+--------------><
                       '-pipeChar--| Branch |-'     

Branch

   .----------------------------------.   
   V                                  |   
|----+------------------------------+-+-------------------------|
     '-| Atom |-+-----------------+-'     
                '--| Quantifier |-'       

Atom

|--+-normalCharacter-------------+------------------------------|
   +-| CharClassExpression |-----+   
   +-| CharClassEscape |---------+   
   +-^---------------------------+   
   +-$---------------------------+   
   '-(--| RegularExpression |--)-'   

Quantifier

|--+-*-------------------------+--+---+-------------------------|
   +-+-------------------------+  '-?-'   
   +-?-------------------------+          
   '-{--+-min-------------+--}-'          
        '-min--,--+-----+-'               
                  '-max-'                 

CharClassExpression

|--[--| CharGroup |--]------------------------------------------|

CharGroup

          .----------------------------------------------.   
          V                                              |   
|--+---+----+-XMLCharIncludeDash-----------------------+-+------|
   '-^-'    +-+-XMLChar----+--dashChar--+-XMLChar----+-+     
            | '-charEscape-'            '-charEscape-' |     
            '-| CharClassEscape |----------------------'     

CharClassEscape

|--+-.----------------+-----------------------------------------|
   +-charEscape-------+   
   +-multiCharEscape--+   
   +-\nonZeroDigit----+   
   +-\p{IsblockName}--+   
   +-\P{IsblockName}--+   
   +-\p{charProperty}-+   
   '-\P{charProperty}-'

Notes:

The syntax for regular-expression represents the content of a string literal that cannot include whitespace characters other than as the specific meaning of the whitespace character as a pattern character. Do not consider spaces or portions between syntax elements as allowing any form of whitespace.

RegularExpression

A regular expression contains one or more branches. Branches are separated by pipes (|), indicating that each branch is an alternative pattern.

pipeChar: A pipe character (|) separates alternative branches in a regular expression.
Branch: A branch consists of zero or more atoms, with each atom allowing an optional quantifier.

Atom

An atom is either a normal character, a character class expression, a character class escape, or a parenthesized regular expression.

normalCharacter: Any valid XML character that is not one of the metacharacters that is in Table 1.
^: When used at the beginning of a branch, the caret (^) indicates that the pattern must match from the beginning of the string.
$: When used at the end of a branch, the dollar sign ($) indicates that the pattern must match from the end of the string.

Quantifier

The quantifier specifies the repetition of an atom in a regular expression. By default, a quantifier will match as much as possible of the target string, using what is referred to as a greedy algorithm. For example, the regular expression 'A.*A' matches the entire string 'ABACADA' because the substring between the required outer 'A' characters matches the requirement for any character any number of times. The default greedy algorithm can be changed by specifying the question mark ( ? ) character after the quantifier. The question mark specifies that the pattern matching uses a reluctant algorithm, which matches to the next shortest substring from left to right in the target string that satisfies the regular expression. For example, the regular expression 'A.*?A' matches the substrings 'ABA' and 'ADA' instead of matching the entire string 'ABACADA'. Characters of a substring that matches a regular expression by using the reluctant algorithm are not considered for further matches. This is why 'ACA' is not considered a match in the previous example. The reluctant algorithm is most useful with the fn:replace function because it processes matches and replacements from left to right.

For example, if you use the greedy algorithm in the function fn:replace("nonsensical","n(.*)s","mus") to replace the string of characters starting with "n" and ending with "s" with the string "mus", the returned value is 'musical'. The original string included substrings "nons" and "ns", which also matched the pattern scanning left to right for the next match, but the greedy algorithm did not operate on these matches because it found a longer enclosing match.

The result is different if you use the reluctant algorithm on the same string in the function fn:replace("nonsensical","n(.*?)s","mus"). The returned value is "musemusical". In this case, two replacements occurred within in the string. The first match replaced "nons" with "mus", and the second match replaced "ns" with "mus".

As another example, if you use the greedy algorithm to replace the character A that encloses any number of characters with the character X that encloses the same characters in the function fn:replace("AbrAcAdAbrA","A(.*)A","X$1X"), the returned value is "XbrAcAdAbrX". The original string included substrings "AbrA" and "AdA", which also matched the pattern when scanning left to right for the next match, but the greedy algorithm did not operate on these matches because it found a longer enclosing match.

The result is different if you use the reluctant algorithm on the same string in the function fn:replace("AbrAcAdAbrA","A(.*?)A","X$1X"). The returned value is "XbrXcXdXbrA". In this case, two replacements occurred within in the string: the first on "AbrA", and the second on "AdA". The final "A" in the string did not get replaced because the reluctant algorithm used all of the preceding "A" characters for other matches within the string. Other substrings that start and end with character "A", such as "AcA", "AcAdA", "AdAbrA" and "AbrA", within the original string are not considered because the reluctant algorithm considers the characters to be already used after they participate in a match to the pattern.

*

Matches the atom zero or more times. Equivalent to the quantifier {0, }.

+

Matches the atom one or more times. Equivalent to the quantifier {1, }.

?

Matches the atom zero or one times. Equivalent to the quantifier {0, 1}. When following another quantifier, indicates use of the reluctant algorithm instead of the greedy algorithm.

min

Matches the atom at least min number of times. min must be a positive integer.

{min} matches the atom exactly min times.
{min, } matches the atom at least min times.

max

Matches the atom at not more than max number of times. max must be a positive integer greater than or equal to min.

{0, max} matches the atom not more than min times.
{0, 0} matches only an empty string.

CharGroup

^

Indicates the complement of the set of characters that are defined by the rest of the CharGroup.

dashChar

The dash character (-)separates two characters that define the outer characters in a range of characters. A character range of the form s-e is the set of UCS2 code points that are greater than or equal to s and less than or equal to e such that:

s is not the backslash character (\)
If s is the first character in a CharGroup, it is not the caret character (^)
e is not the backslash character (\) or the opening bracket character ([)
The code point of e is greater than the code point of s

XMLCharIncludeDash

A single character from the set of valid XML characters, excluding the backslash (\) and brackets ([]), but including the dash (-). The dash is valid as a character only at the beginning or the end of a CharGroup. The caret (^) at the beginning of a CharGroup indicates the complement of the group. Anywhere else in the group, the caret just matches the caret character. XMLCharIncludeDash can include any character that is matched by the regular expression [^\#5B#5D].

XMLChar

A single character from the set of valid XML characters, excluding the backslash (\), brackets ([]), and the dash (-). The dash is valid as a character only at the beginning or the end of a CharGroup. The caret (^) at the beginning of a CharGroup indicates the complement of the group. Anywhere else in the group, the caret just matches the caret character. XMLChar can include any character that is matched by the regular expression [^\#2D#5B#5D].

charEscape

A backslash followed by a single metacharacter, newline character, return character, or tab character. You must escape the characters that are in Table 1 in a regular expression to match them.

Table 1. Valid metacharacter escapes
Character escape	Character represented	Description
`\n`	#x0A	Newline
`\r`	#x0D	Return
`\t`	#x09	Tab
`\\`	\	Backslash
`\\|`	\|	Pipe
`\.`	.	Period
`\-`	-	Dash
`\^`	^	Caret
`\?`	?	Question mark
`\$`	$	Dollar sign
`\*`	*	Asterisk
`\+`	+	Plus sign
`\{`	{	Opening curly brace
`\}`	}	Closing curly brace
`\(`	(	Opening parenthesis
`\)`	)	Closing parenthesis
`\[`	[	Opening bracket
`\]`	]	Closing bracket

CharClassEscape

.

The period character ( . ) matches all characters except newline and return characters. The period character is quivalent to the expression [^\n\r].

\nonZeroDigit

Specifies a back reference that matches the string that was matched by a subexpression, which is surrounded by parentheses, in the nonZeroDigit position in the regular expression. nonZeroDigit must be between 1 and 9. The first 9 subexpressions can be referenced.

Note: For future upward compatibility, if a back reference is followed by a digit character, enclose the back reference in parentheses. For example, a back reference to the first subexpression that is followed by the digit 3 should be expressed as (/1)3 instead of /13 even though both currently produce the same result.

\P{IsblockName}

Specifies the complement of a range of Unicode code points. The range is identified by blockName, as listed in XML Schema Part 2: Datatypes Second Edition.

\p{IsblockName}

Specifies a character in a specific range of Unicode code points. The range is identified by blockName, as listed in XML Schema Part 2: Datatypes Second Edition.

charEscape

A backslash followed by a single metacharacter, newline character, return character, or tab character. You must escape the characters that are in Table 1 in a regular expression to match them.

multiCharEscape

A backslash followed by a character that identifies commonly used sets of characters that are in Table 2 in a regular expression to match them.

Table 2. Multi-character escapes
Multi-character escape	Equivalent regular expression	Description
`\s`	`[#x20\t\n\r]`	Space, tab, newline, or return character.
`\S`	`[^\s]`	Any character except a space, tab, newline, or return character.
`\i`	none	The set of characters allowed as the first character in an XML name.
`\I`	`[^\i]`	Not in the set of characters allowed as the first character in an XML name.
`\c`	none	The set of characters allowed in an XML name.
`\C`	`[^\c]`	Not in the set of characters allowed in an XML name.
`\d`	`\p{Nd}`	A decimal digit.
`\D`	`[^\d]`	Not a decimal digit.
`\w`	`[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]`	A word character, which includes the following `charProperty` categories: letters, marks, symbols, and numbers.
`\W`	`[^\w]`	A non-word character, which includes the following `charProperty` categories: punctuation, separators, and other.

\p{charProperty}

Specifies a character in a category. The categories are listed in Table 3.

\P{charProperty}

Specifies the complement of a character category. The categories are listed in Table 3.

Table 3. Supported values of charProperty
`charProperty`	Category	Description
L	Letters	All letters
Lu	Letters	Uppercase
Ll	Letters	Lowercase
Lt	Letters	Title case
Lm	Letters	Modifier
Lo	Letters	Other
M	Marks	All marks
Mn	Marks	Nonspacing
Mc	Marks	Spacing combining
Me	Marks	Enclosing
N	Numbers	All numbers
Nd	Numbers	Decimal digit
Nl	Numbers	Letter
No	Numbers	Other
P	Punctuation	All punctuation
Pc	Punctuation	Connector
Pd	Punctuation	Dash
Ps	Punctuation	Open
Pe	Punctuation	Close
Pi	Punctuation	Initial quotation mark (can behave like Ps or Pe depending on usage)
Pf	Punctuation	Final quotation mark (can behave like Ps or Pe depending on usage)
Po	Punctuation	Other
Z	Separators	All separators
Zs	Separators	Space
Zl	Separators	Line
Zp	Separators	Paragraph
S	Symbols	All symbols
Sm	Symbols	Math
Sc	Symbols	Currency
Sk	Symbols	Modifier
So	Symbols	Other
C	Other	All others
Cc	Other	Control
Cf	Other	Format
Co	Other	Private use
Cn	Other	Not assigned

Note: Regular expressions are matched using a binary comparison. The default collation is not used.