POSIX Regular Expression Syntax and Examples

Regular expressions (often referred to simply as "regex") can be much more complex than expressions that use the wildcard characters which were discussed in the previous section. Unlike wildcards, regular expressions will match character sequences containing the patterns that they specify regardless of where that pattern appears in a word. As explained later in this section, you can use the anchor symbols '^' (beginning of word) and '$' (end of word) to restrict where in a word a regular expression will be matched, or to restrict that match to entire words by specifying both anchor symbols.

Regular expressions assign special meaning to various characters, which are often referred to as metacharacters:

period, dot, or full-stop (.) - matches any single-width ASCII character in an expression, with the exception of line break characters. To match multi-byte characters with a single period, you must use Perl-compatible regular expressions, as discussed in Perl Compatible Regular Expression Syntax.
Because Watson Explorer Engine's regular expression support is term-oriented, the '.' will also not match the space or tab by default, which are word breaking characters. For example, the regular expression 'f.rm' will match any words containing character sequences such as 'farm', 'firm', and 'form', including 'farmer', 'firmament', and 'conform' - any word that contains a sequence of characters consisting of an 'f', followed by any other character, followed by with the characters 'rm'.

Tip: The '.' symbol is the equivalent of the '?' character in a wildcard expression. The '.*' sequence is the equivalent of the '*' in a wildcard expression.
asterisk or star (*) - matches the preceding token zero or more times. For example, the regular expression 'to*' would match words containing the letter 't' and strings such as 'it', 'to' and 'too', because the preceding token is the single character 'o', which can appear zero times in a matching expression. The regular expression 'f[aio]*t' would match the words 'fat', 'fit', 'fait', 'fiat', and 'foot' because the preceding token is the character class consisting of any of 'a', 'i', or 'o'.
plus sign (+) - matches the preceding token one or more times. In contrast to the example given in the previous bullet, the regular expression 'to+' would only match words containing the character sequences 'to' and 'too', because the preceding token is the single character 'o', which must appear at least once in a matching expression. The regular expression 'f[aio]+t' would match words containing the character sequences 'fit', 'fat', 'fait', 'fiat', and 'foot' because the preceding token is the character class consisting of any of 'a', 'i', or 'o', and at least one character from that character set must be present to match the regular expression.
question mark (?) - identifies the preceding character as being optional. For example, the regular expression 'too?' would match words containing the character sequences 'to' and 'too'.
vertical bar or pipe (|) - separates tokens, one of which must be matched, much like a logical OR statement. For example, the regular expression 'fa|i|ot' matches words containing the character sequences 'fa', 'i', 'fat', or 'fit' because it can be viewed as any of 'fa' or 'i' or 'ot', or the sequence 'f and (a or i or o) and t'. Any portion of a regular expression that uses the '|' symbol is often enclosed in parentheses to disambiguate the tokens to which the '|' applies. (See the next bullet for an example.)
open and close round bracket or parenthesis ('(' and ')') - groups multiple tokens together to disambiguate or simplify references to them. For example, the regular expression 'f(a|i|o)t' matches words containing the character sequences 'fat' or 'fit' but not the word 'fa', because matching sequences must now consist of three characters where the middle character has been restricted to being one of the letters 'a or i or o'.
open square bracket ([) and close square bracket (]) - enclose specific characters or a range of characters to be matched. The characters enclosed inside square brackets are known as a character class. For example, the regular expression 'f[io]rm' will match words containing the character sequences 'firm' and 'form', but will not match any other word containing other sequences that begin with 'f' and ending with 'rm'. A character class only matches a single character unless it is followed by another character that has special meaning in a regular expression.
caret (^) - the caret has two different meanings in a regular expression, depending on where it appears:
- As the first character in a character class, a caret negates the characters in that character class. For example, the regular expression 'f[^io]rm' will match any word containing a sequence of characters beginning with 'f' and ending with 'rm', except where either 'i' or 'o' is the second character. It will therefore match words containing the character sequence 'farm', but not words containing the sequences 'firm' or 'form'.
- As the first character in a regular expression, a caret identifies the beginning of a term. In this context, the caret is often referred to as an anchor character.
dollar sign ($) - as the last character in a regular expression, a dollar sign identifies the end of a term. In this context, the dollar sign is often referred to as an anchor character.
Note: Anchor characters are very important if you want to restrict regular expression matches to entire words. For example, the regular expression 'f[aio]rm' will match words containing any of the strings 'farm', 'firm', and 'form', including words such as 'farmer', 'infirm', 'former', and 'conform', while the regular expression '^f[aio]rm' will only match the words 'farmer' and 'former' from these examples, and the regular expression '^f[air]rm$' will only match the words 'farm', 'firm', and 'form'.
backslash (\) - used to invoke the actual character value for a metacharacter in a regular expression. For example, the regular expression 'Comin?' will match the words 'Coming', 'Comint', and the question 'Comin?'. The regular expression 'Comin\?' will only match the question 'Comin?'.

Note that when configuring an HTTP Referrer Whitelist,use only the regular expression syntax discussed in the preceding list. However, regular expression syntax also supports a number of special character sequences to match non-printable characters, special character classes such as digits and alphabetic characters, and so on. Discussing complete regular expression syntax is outside the scope of the Watson Explorer Engine documentation. For a complete discussion of regular expressions, see the Regular Expressions Information site.