Regular expressions
Regular expressions are used in the following XQuery functions: fn:matches, fn:replace, and fn:tokenize. DB2® XQuery regular expression support is based on the XML schema regular expression support as defined in the W3C Recommendation XML Schema Part 2: Datatypes Second Edition with extensions as defined by W3C Recommendation XQuery 1.0 and XPath 2.0 Functions and Operators.
Syntax
RegularExpression .--------------------------. (1) V | >>-------| Branch |----+----------------------+-+-------------->< '-pipeChar--| Branch |-' Branch .----------------------------------. V | |----+------------------------------+-+-------------------------| '-| Atom |-+-----------------+-' '--| Quantifier |-' Atom |--+-normalCharacter-------------+------------------------------| +-| CharClassExpression |-----+ +-| CharClassEscape |---------+ +-^---------------------------+ +-$---------------------------+ '-(--| RegularExpression |--)-' Quantifier |--+-*-------------------------+--+---+-------------------------| +-+-------------------------+ '-?-' +-?-------------------------+ '-{--+-min-------------+--}-' '-min--,--+-----+-' '-max-' CharClassExpression |--[--| CharGroup |--]------------------------------------------| CharGroup .----------------------------------------------. V | |--+---+----+-XMLCharIncludeDash-----------------------+-+------| '-^-' +-+-XMLChar----+--dashChar--+-XMLChar----+-+ | '-charEscape-' '-charEscape-' | '-| CharClassEscape |----------------------' CharClassEscape |--+-.----------------+-----------------------------------------| +-charEscape-------+ +-multiCharEscape--+ +-\nonZeroDigit----+ +-\p{IsblockName}--+ +-\P{IsblockName}--+ +-\p{charProperty}-+ '-\P{charProperty}-'
- The syntax for regular-expression represents the content of a string literal that cannot include whitespace characters other than as the specific meaning of the whitespace character as a pattern character. Do not consider spaces or portions between syntax elements as allowing any form of whitespace.
RegularExpression
A regular expression contains one or more branches. Branches are separated by pipes (|), indicating that each branch is an alternative pattern.
- pipeChar
- A pipe character (|) separates alternative branches in a regular expression.
- Branch
- A branch consists of zero or more atoms, with each atom allowing an optional quantifier.
Atom
An atom is either a normal character, a character class expression, a character class escape, or a parenthesized regular expression.- normalCharacter
- Any valid XML character that is not one of the metacharacters that is in Table 1.
- ^
- When used at the beginning of a branch, the caret (^) indicates that the pattern must match from the beginning of the string.
- $
- When used at the end of a branch, the dollar sign ($) indicates that the pattern must match from the end of the string.
Quantifier
The quantifier specifies the repetition of an atom in a regular expression. By default, a quantifier will match as much as possible of the target string, using what is referred to as a greedy algorithm. For example, the regular expression 'A.*A' matches the entire string 'ABACADA' because the substring between the required outer 'A' characters matches the requirement for any character any number of times. The default greedy algorithm can be changed by specifying the question mark ( ? ) character after the quantifier. The question mark specifies that the pattern matching uses a reluctant algorithm, which matches to the next shortest substring from left to right in the target string that satisfies the regular expression. For example, the regular expression 'A.*?A' matches the substrings 'ABA' and 'ADA' instead of matching the entire string 'ABACADA'. Characters of a substring that matches a regular expression by using the reluctant algorithm are not considered for further matches. This is why 'ACA' is not considered a match in the previous example. The reluctant algorithm is most useful with the fn:replace function because it processes matches and replacements from left to right.For example, if you use the greedy algorithm in the function fn:replace("nonsensical","n(.*)s","mus") to replace the string of characters starting with "n" and ending with "s" with the string "mus", the returned value is 'musical'. The original string included substrings "nons" and "ns", which also matched the pattern scanning left to right for the next match, but the greedy algorithm did not operate on these matches because it found a longer enclosing match.
The result is different if you use the reluctant algorithm on the same string in the function fn:replace("nonsensical","n(.*?)s","mus"). The returned value is "musemusical". In this case, two replacements occurred within in the string. The first match replaced "nons" with "mus", and the second match replaced "ns" with "mus".
As another example, if you use the greedy algorithm to replace the character A that encloses any number of characters with the character X that encloses the same characters in the function fn:replace("AbrAcAdAbrA","A(.*)A","X$1X"), the returned value is "XbrAcAdAbrX". The original string included substrings "AbrA" and "AdA", which also matched the pattern when scanning left to right for the next match, but the greedy algorithm did not operate on these matches because it found a longer enclosing match.
The result is different if you use the reluctant algorithm on the same string in the function fn:replace("AbrAcAdAbrA","A(.*?)A","X$1X"). The returned value is "XbrXcXdXbrA". In this case, two replacements occurred within in the string: the first on "AbrA", and the second on "AdA". The final "A" in the string did not get replaced because the reluctant algorithm used all of the preceding "A" characters for other matches within the string. Other substrings that start and end with character "A", such as "AcA", "AcAdA", "AdAbrA" and "AbrA", within the original string are not considered because the reluctant algorithm considers the characters to be already used after they participate in a match to the pattern.
- *
- Matches the atom zero or more times. Equivalent to the quantifier {0, }.
- +
- Matches the atom one or more times. Equivalent to the quantifier {1, }.
- ?
- Matches the atom zero or one times. Equivalent to the quantifier {0, 1}. When following another quantifier, indicates use of the reluctant algorithm instead of the greedy algorithm.
- min
- Matches the atom at least min number of times. min must
be a positive integer.
- {min} matches the atom exactly min times.
- {min, } matches the atom at least min times.
- max
- Matches the atom at not more than max number
of times. max must be a positive integer greater
than or equal to min.
- {0, max} matches the atom not more than min times.
- {0, 0} matches only an empty string.
CharGroup
- ^
- Indicates the complement of the set of characters that are defined by the rest of the CharGroup.
- dashChar
- The dash character (-)separates two characters that define the
outer characters in a range of characters. A character range of the
form s-e is the set of UCS2 code points that are
greater than or equal to s and less than or equal
to e such that:
- s is not the backslash character (\)
- If s is the first character in a CharGroup, it is not the caret character (^)
- e is not the backslash character (\) or the opening bracket character ([)
- The code point of e is greater than the code point of s
- XMLCharIncludeDash
- A single character from the set of valid XML characters, excluding the backslash (\) and brackets ([]), but including the dash (-). The dash is valid as a character only at the beginning or the end of a CharGroup. The caret (^) at the beginning of a CharGroup indicates the complement of the group. Anywhere else in the group, the caret just matches the caret character. XMLCharIncludeDash can include any character that is matched by the regular expression [^\#5B#5D].
- XMLChar
- A single character from the set of valid XML characters, excluding the backslash (\), brackets ([]), and the dash (-). The dash is valid as a character only at the beginning or the end of a CharGroup. The caret (^) at the beginning of a CharGroup indicates the complement of the group. Anywhere else in the group, the caret just matches the caret character. XMLChar can include any character that is matched by the regular expression [^\#2D#5B#5D].
- charEscape
- A backslash followed by a single metacharacter, newline character,
return character, or tab character. You must escape the characters
that are in Table 1 in a regular
expression to match them.
Table 1. Valid metacharacter escapes Character escape Character represented Description \n #x0A Newline \r #x0D Return \t #x09 Tab \\ \ Backslash \| | Pipe \. . Period \- - Dash \^ ^ Caret \? ? Question mark \$ $ Dollar sign \* * Asterisk \+ + Plus sign \{ { Opening curly brace \} } Closing curly brace \( ( Opening parenthesis \) ) Closing parenthesis \[ [ Opening bracket \] ] Closing bracket
CharClassEscape
- .
- The period character ( . ) matches all characters except newline and return characters. The period character is quivalent to the expression [^\n\r].
- \nonZeroDigit
- Specifies a back reference that matches the string that was matched
by a subexpression, which is surrounded by parentheses, in the nonZeroDigit position
in the regular expression. nonZeroDigit must be
between 1 and 9. The first 9 subexpressions can be referenced. Note: For future upward compatibility, if a back reference is followed by a digit character, enclose the back reference in parentheses. For example, a back reference to the first subexpression that is followed by the digit 3 should be expressed as (/1)3 instead of /13 even though both currently produce the same result.
- \P{IsblockName}
- Specifies the complement of a range of Unicode code points. The range is identified by blockName, as listed in XML Schema Part 2: Datatypes Second Edition.
- \p{IsblockName}
- Specifies a character in a specific range of Unicode code points. The range is identified by blockName, as listed in XML Schema Part 2: Datatypes Second Edition.
- charEscape
- A backslash followed by a single metacharacter, newline character, return character, or tab character. You must escape the characters that are in Table 1 in a regular expression to match them.
- multiCharEscape
- A backslash followed by a character that identifies commonly used
sets of characters that are in Table 2 in
a regular expression to match them.
Table 2. Multi-character escapes Multi-character escape Equivalent regular expression Description \s [#x20\t\n\r] Space, tab, newline, or return character. \S [^\s] Any character except a space, tab, newline, or return character. \i none The set of characters allowed as the first character in an XML name. \I [^\i] Not in the set of characters allowed as the first character in an XML name. \c none The set of characters allowed in an XML name. \C [^\c] Not in the set of characters allowed in an XML name. \d \p{Nd} A decimal digit. \D [^\d] Not a decimal digit. \w [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] A word character, which includes the following charProperty categories: letters, marks, symbols, and numbers. \W [^\w] A non-word character, which includes the following charProperty categories: punctuation, separators, and other. - \p{charProperty}
- Specifies a character in a category. The categories are listed in Table 3.
- \P{charProperty}
- Specifies the complement of a character category. The categories are listed in Table 3.
charProperty | Category | Description |
---|---|---|
L | Letters | All letters |
Lu | Letters | Uppercase |
Ll | Letters | Lowercase |
Lt | Letters | Title case |
Lm | Letters | Modifier |
Lo | Letters | Other |
M | Marks | All marks |
Mn | Marks | Nonspacing |
Mc | Marks | Spacing combining |
Me | Marks | Enclosing |
N | Numbers | All numbers |
Nd | Numbers | Decimal digit |
Nl | Numbers | Letter |
No | Numbers | Other |
P | Punctuation | All punctuation |
Pc | Punctuation | Connector |
Pd | Punctuation | Dash |
Ps | Punctuation | Open |
Pe | Punctuation | Close |
Pi | Punctuation | Initial quotation mark (can behave like Ps or Pe depending on usage) |
Pf | Punctuation | Final quotation mark (can behave like Ps or Pe depending on usage) |
Po | Punctuation | Other |
Z | Separators | All separators |
Zs | Separators | Space |
Zl | Separators | Line |
Zp | Separators | Paragraph |
S | Symbols | All symbols |
Sm | Symbols | Math |
Sc | Symbols | Currency |
Sk | Symbols | Modifier |
So | Symbols | Other |
C | Other | All others |
Cc | Other | Control |
Cf | Other | Format |
Co | Other | Private use |
Cn | Other | Not assigned |