Regular expressions (often referred to simply as "regex") can be much more complex than
expressions that use the wildcard characters which were discussed in the previous section.
Unlike wildcards, regular expressions will match character sequences containing the patterns
that they specify regardless of where that pattern appears in a word. As explained later in
this section, you can use the anchor symbols '^' (beginning of word) and '$'
(end of word) to restrict where in a word a regular expression will be matched, or to restrict
that match to entire words by specifying both anchor symbols.
Regular expressions assign special meaning to various characters, which are often referred to
as metacharacters:
- period, dot, or full-stop (.) - matches any single-width ASCII character in an
expression, with the exception of line break characters. To match multi-byte characters with
a single period, you must use Perl-compatible regular expressions, as discussed in Perl Compatible Regular Expression Syntax.
Because Watson Explorer Engine's regular expression support
is term-oriented, the '.' will also not match the space or tab by default, which are word
breaking characters. For example, the regular expression 'f.rm' will match any
words containing character sequences such as 'farm', 'firm', and 'form', including
'farmer', 'firmament', and 'conform' - any word that contains a sequence of characters
consisting of an 'f', followed by any other character, followed by with the characters
'rm'.
Tip: The '.' symbol is the equivalent of the '?' character in a wildcard
expression. The '.*' sequence is the equivalent of the '*' in a wildcard
expression.
- asterisk or star (*) - matches the preceding token zero or more times. For
example, the regular expression 'to*' would match words containing the letter 't'
and strings such as 'it', 'to' and 'too', because the preceding token is the single
character 'o', which can appear zero times in a matching expression. The regular expression
'f[aio]*t' would match the words 'fat', 'fit', 'fait', 'fiat', and 'foot' because
the preceding token is the character class consisting of any of 'a', 'i', or 'o'.
- plus sign (+) - matches the preceding token one or more times. In contrast to the
example given in the previous bullet, the regular expression 'to+' would only match
words containing the character sequences 'to' and 'too', because the preceding token is the
single character 'o', which must appear at least once in a matching expression. The regular
expression 'f[aio]+t' would match words containing the character sequences 'fit',
'fat', 'fait', 'fiat', and 'foot' because the preceding token is the character class
consisting of any of 'a', 'i', or 'o', and at least one character from that character set
must be present to match the regular expression.
- question mark (?) - identifies the preceding character as being optional. For
example, the regular expression 'too?' would match words containing the character
sequences 'to' and 'too'.
- vertical bar or pipe (|) - separates tokens, one of which must be matched, much
like a logical OR statement. For example, the regular expression 'fa|i|ot' matches
words containing the character sequences 'fa', 'i', 'fat', or 'fit' because it can be viewed
as any of 'fa' or 'i' or 'ot', or the sequence 'f and (a or i or o) and t'. Any portion of a
regular expression that uses the '|' symbol is often enclosed in parentheses to disambiguate
the tokens to which the '|' applies. (See the next bullet for an example.)
- open and close round bracket or parenthesis ('(' and ')') - groups
multiple tokens together to disambiguate or simplify references to them. For example, the
regular expression 'f(a|i|o)t' matches words containing the character sequences
'fat' or 'fit' but not the word 'fa', because matching sequences must now consist of three
characters where the middle character has been restricted to being one of the letters 'a or
i or o'.
- open square bracket ([) and close square bracket (]) - enclose specific
characters or a range of characters to be matched. The characters enclosed inside square
brackets are known as a character class. For example, the regular expression
'f[io]rm' will match words containing the character sequences 'firm' and 'form',
but will not match any other word containing other sequences that begin with 'f' and ending
with 'rm'. A character class only matches a single character unless it is followed by
another character that has special meaning in a regular expression.
- caret (^) - the caret has two different meanings in a regular expression,
depending on where it appears:
- As the first character in a character class, a caret negates the characters in that
character class. For example, the regular expression 'f[^io]rm' will match any
word containing a sequence of characters beginning with 'f' and ending with 'rm', except
where either 'i' or 'o' is the second character. It will therefore match words
containing the character sequence 'farm', but not words containing the sequences 'firm'
or 'form'.
- As the first character in a regular expression, a caret identifies the beginning of a
term. In this context, the caret is often referred to as an anchor
character.
- dollar sign ($) - as the last character in a regular expression, a dollar sign
identifies the end of a term. In this context, the dollar sign is often referred to as an
anchor character.
Note: Anchor characters are very important if you
want to restrict regular expression matches to entire words. For example, the regular
expression 'f[aio]rm' will match words containing any of the strings 'farm',
'firm', and 'form', including words such as 'farmer', 'infirm', 'former', and 'conform',
while the regular expression '^f[aio]rm' will only match the words 'farmer' and
'former' from these examples, and the regular expression '^f[air]rm$' will only
match the words 'farm', 'firm', and 'form'.
- backslash (\) - used to invoke the actual character value for a metacharacter in
a regular expression. For example, the regular expression 'Comin?' will match the
words 'Coming', 'Comint', and the question 'Comin?'. The regular expression
'Comin\?' will only match the question 'Comin?'.
Note that when configuring an HTTP Referrer Whitelist,use only the regular expression syntax discussed in the
preceding list. However, regular expression syntax also supports a number of special character
sequences to match non-printable characters, special character classes such as digits and
alphabetic characters, and so on. Discussing complete regular expression syntax is outside the
scope of the Watson Explorer Engine documentation. For a complete
discussion of regular expressions, see the Regular Expressions Information site.