IBM Streams 4.2

Operator MatchRegex

Primitive operator image not displayed. Problem loading file: ../../image/tk$com.ibm.streams.cep/op$com.ibm.streams.cep$MatchRegex.svg

The MatchRegex operator matches a regular expression pattern over the sequence of input tuples to detect composite events.

A sequence of tuples flow into the MatchRegex operator; each of the tuples is called a single event. The operator uses regular expressions and predicates to detect a pattern across a sequence of simple events. When the pattern is detected, the operator emits a tuple to signify that the pattern is matched across one or more tuples; this is referred to as a complex event.

If a pattern involves two or more regular expressions in sequence, the operator detects partial matches when the first event occurs. If the subsequent tuples fit the pattern, the match is completed. A completed match occurs when each individual predicate is matched against an individual input tuple, and the entire regular expression matches a sequence of consecutive input tuples.

Standard regular expression semantics apply for the patterns that the MatchRegex operator uses. Matching is contiguous and non-greedy, and completed matches are non-overlapping. For more information, see the pattern parameter description.

Matching is partition-isolated if the partitionBy parameter is specified. There is no interference between matches on different partitions.

The MatchRegex operator allows aggregated assignments to output attributes. For example:

output
  Matches : symbol=symbol, seqNum=First(seqNum), 
  count=Count(), maxPrice=Max(price);

Aggregators are custom output functions. Other SPL operators, such as Aggregate or XMLParse have similar output functions. Unlike those operators, the aggregators for the MatchRegex operator can be used not only in the output assignments, but also in the predicates parameter. Aggregators are operator-specific intrinsic functions in the sense that the code generator for the operator replaces them by incremental aggregations. The current value of the aggregation is updated one tuple at a time, which saves space because only the summary information is kept in memory. This behavior also saves time because when a complete match is detected, the output is quickly computed.

An aggregator in a predicate refers to the tuples in the partial match so far. That is to say, the match might not be complete yet, since it has not yet reached the end of the regular expression. An aggregator in an output assignment refers to the tuples in the completed match, for which an output tuple is being generated. For example: output Matches : seqNum=First(seqNum), maxPrice=Max(price).

The aggregators include or exclude the current tuple, depending on the circumstances. In output assignments, all aggregators include the current tuple because it is part of the match. In predicates, the aggregators Count, Delta, and IsMatch include the current tuple, because these functions are commonly used to achieve window-like behavior. In predicates, all other aggregators (such as Min, Max, or Sum) exclude the current tuple because these functions are commonly used to find changes in behavior, starting at the latest tuple.

Output assignments are optional: the MatchRegex operator performs auto-assignment if an assignment is missing. The auto-assignment uses the current input tuple. Furthermore, if the forwardUnmatched parameter is set to true and the tuple did not match the pattern, the aggregations return values as if the entire pattern was "." and matched the current input tuple, except that IsMatch() returns false.

Consistent Region Behavior
  • The operator can participate in a consistent region.
  • The operator cannot be at the start of a consistent region.
Examples

Summary

Ports
This operator has 1 input port and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 5 parameters.

Required: pattern, predicates

Optional: flushOnPunct, forwardUnmatched, partitionBy

Metrics
This operator does not report any metrics.

Properties

Implementation
C++
Threading
Always - Operator always provides a single threaded execution context.

Input Ports

Ports (0)

The MatchRegex operator is configurable with a single input port. The input port is non-mutating and its punctuation mode is Oblivious.

Properties

Output Ports

Assignments
This operator allows any SPL expression of the correct type to be assigned to output attributes. Attributes not assigned in the output clause will be automatically assigned from the attributes of the input ports that have the same name and type. If there is no such input attribute, an error is reported at compile-time.
Output Functions
PredicateFunctions
<numeric T> public T AsIs(T v)

Returns the argument.

<numeric T> public T Average(T v)

Returns the arithmetic mean, which is the average value of the specified attribute for tuples in the match.

<any T> public list<T> Collect(T v)

Returns a list of the specified attribute values for all tuples in the match.

public int32 Count()

Returns the number of tuples in the match. That is to say, the number of simple events.

<any T> public T Any(T v)

Returns the value of the attribute of one of the tuples in the match.

public float64 Delta(timestamp v)

Returns the difference between the specified timestamp attribute in last and first tuple in the match.

<any T> public T First(T v)

Returns the value of the input attribute that corresponds to the first tuple in the match.

<any T> public T Last(T v)

Returns the value of the input attribute that corresponds to the last tuple in the match.

<ordered T> public T Max(T v)

Returns the largest input attribute value for tuples in the match.

<ordered T> public T Min(T v)

Returns the smallest input attribute value for tuples in the match.

<numeric T> public T Sum(T v)

Returns the sum of the specified attribute values for tuples in the match.

public boolean IsMatch()

Returns true if the tuple is the result of a matched pattern. That is to say, the match is complete.

Ports (0)

The MatchRegex operator is configurable with one output port. The output port is mutating and its punctuation mode is Generating. Each time the operator detects a match, it submits an output tuple, followed by a window punctuation. In addition, if the optional forwardUnmatched parameter is true, the operator also forwards input tuples when there is no match.

Properties

Parameters

This operator supports 5 parameters.

Required: pattern, predicates

Optional: flushOnPunct, forwardUnmatched, partitionBy

flushOnPunct

This optional parameter tells the operator to discard all partial matches and restart from the beginning when the operator receives a window punctuation. If not specified, the default is false.

Properties

forwardUnmatched

This optional parameter tells the operator to generate an output tuple for all input tuples, even if the pattern is not matched. If omitted, the parameter defaults to false, in which case the operator generates an output tuple only when it detects a match. When forwardUnmatched is true, the operator generates exactly one output tuple per input tuple. This behavior is useful when the downstream operators need access to the entire stream in the correct order. Matches are marked either by a flag in the tuple, or by punctuation after each match. For example, this parameter value can be used for cascaded pattern matching, where each pattern sets a classification flag in each tuple where it detects a match.

Tip: When forwardUnmatched is true, you can use the IsMatch() function in the output clause to determine if a match occurred.

Properties

partitionBy

This optional parameter specifies one or more attributes from the input tuples as a key. These expressions are used to partition the input tuples into substreams. The pattern is applied to each substream independently. For example, if the key is symbol, then the operator matches tuples with symbol=="IBM" separately from tuples with symbol=="MSFT". These partitioning semantics are similar to the semantics used in other SPL operators, such as Aggregate.

Note: The MatchRegex operator does not support a notion of partition eviction. In most cases, however, partition eviction is unnecessary, since the domain is limited.

Properties

pattern

This mandatory parameter value contains a regular expression of predicates that specify the match condition. For example, "rise+ drop+ rise+ drop* deep". The regular expression works on the granularity of input tuples. The basic components of the regular expression are predicate identifiers. Each individual predicate is matched against an individual input tuple, and the entire regular expression matches a sequence of consecutive input tuples.

The regular expression uses the following syntax rules:
  • id (Identifier): Identifies a predicate.
  • . (Wildcard): Represents the true predicate. The wildcard matches any input event.
  • re1 re2 (Concatenation): Concatenates two regular expressions.
  • re1|re2 (Disjunction): Specifies re1 or re2.
  • re* (Kleene star): Specifies that there are zero or more repetitions of the regular expression re.
  • re+ (Kleene cross): Specifies that there are one or more repetitions of the regular expression re.
  • re? (Optional): Specifies that there is zero or one occurrence of the regular expression re.
  • (re) (Grouping): Parentheses override operator preference.
  • (Empty): The empty regular expression matches a zero-length sequence of events. For example, the pattern (a|)b is equivalent to the pattern a?b.

In a regular expression, the operator precedence, from highest to lowest, is as follows: grouping; postfix operators (*, +, ?); concatenation; disjunction. The pattern must not start with a predicate that uses an aggregator. The reason for this restriction is that the values of the aggregator are not valid until the pattern matches the first input tuple.

Matching is non-greedy. For example, the pattern a+ matches the first simple event that satisfies a. This property is also known as right-minimality. If matching was greedy, the MatchRegex operator would have to wait until the next tuple failed the predicate a before it could report the match for a+. You can emulate greedy matching by explicitly adding another predicate at the end of the regular expression as a stop predicate.

Matching is contiguous or partition-contiguous. If you do not specify the partitionBy parameter, matching is fully contiguous and does not skip any events. If you specify the partitionBy parameter, matching is contiguous only regarding events for the same partition; events for other partitions are skipped.

Completed matches are non-overlapping in their partition. When the operator detects a completed match, it reports the longest completed match by sending it on the output stream. The operator then discards all partial matches for that partition. This property is also known as left-maximality. This behavior is consistent with the behavior of regular expressions over strings in most programming languages.

Properties

predicates

This mandatory parameter defines the predicates that are used in the regular expression. Predicates are defined by using tuple notation. Each predicate has a name and a boolean expression. For example:

param
	pattern: ". rise+ drop+ rise+ drop* deep";
	predicates: {
	rise= price>First(price) && price>=Last(price),
	drop= price>=First(price) && price<Last(price),
	deep= price<First(price) && price<Last(price)};

Predicates can refer to attributes of the input tuple (for example, price), and can use aggregations (for example, First(price)). Predicates can also call normal SPL functions (for example, disjointTags(tweet1, tweet2)). Predicates must be side-effect free.

Properties

Code Templates

MatchRegex
stream<${schema}> ${outputStream} = MatchRegex(${inputStream}) {
            param
                pattern : "${pattern}";
                predicates : { $predicates} };
            output
                ${outputStream} : ${outputAttribute} = ${value};

        }