Operator Parse

Primitive operator image not displayed. Problem loading file: ../../image/tk$spl/op$spl.utility$Parse.svg

The Parse operator is similar to the FileSource, TCPSource, and UDPSource operators, in that it transforms input data in a raw form into well-structured SPL tuples. The difference is that unlike source adapters, the Parse operator is not tied to a particular external resource. Instead, it can be used inside an SPL application on data that came from any external source.

The Parse operator accepts data in many formats (such as line or bin), therefore the data is passed in using a blob attribute. The Parse operator generates the SPL tuples corresponding to the input format.

Checkpointed data

When the Parse operator is checkpointed in a consistent region, any partially parsed input data and logic state variables (if present) are saved in checkpoint. When the Parse operator is checkpointed in an autonomous region, logic state variables (if present) are saved in checkpoint.

Behavior in a consistent region

The Parse operator can be used in a consistent region, but not as a start operator. When a region is drained, the Parse operator reads as much of its input as it can produce output tuples from, but there might be some residual data that is not sufficient to produce an output tuple. This residual data, if any, is stored in the checkpoint. On reset, the Parse operator clears any input data it has, reads the residual data from the checkpoint, and adds that as the start of its read buffer. Logic state variables (if present) are also automatically checkpointed and reset.

Checkpointing behavior in an autonomous region

When the Parse operator is in an autonomous region and configured with config checkpoint : periodic(T) clause, a background thread in SPL Runtime checkpoints the operator every T seconds, and such periodic checkpointing activity is asynchronous to tuple processing. Upon restart, the operator restores its internal state to its initial state, and restores logic state variables (if present) from the last checkpoint.

When the Parse operator is in an autonomous region and configured with config checkpoint : operatorDriven clause, no checkpoint is taken at runtime. Upon restart, the operator restores to its initial state.

Such checkpointing behavior is subject to change in the future.

Exceptions

If there are errors while extracting tuples from the input data, the Parse operator generates a tracing message or throws an exception. You can use the parsing parameter to control this behavior.

Examples

Summary

Ports
This operator has 1 input port and 1 output port.
Windowing
This operator does not accept any windowing configurations.
Parameters
This operator supports 11 parameters.

Optional: blockSize, defaultTuple, eolMarker, format, hasDelayField, hasHeaderLine, ignoreExtraCSVValues, parseInput, parsing, readPunctuations, separator

Metrics
This operator reports 1 metric.

Properties

Implementation
C++
Threading
Always - Operator always provides a single threaded execution context.

Input Ports

Ports (0)

The Parse operator is configurable with a single input port, which ingests tuples that contain data to be parsed into tuples.

Properties

Output Ports

Assignments
This operator requires that assignments made to output attributes cannot reference input stream attributes.
Output Functions
OutputFunctions
int64 TupleNumber()

Tuple number generated in this file

<any T> T AsIs(T)

Return the input value

Ports (0)

The Parse operator is configurable with a single output port, which produces tuples that are parsed from the input data.

If the format parameter value is bin and the the readPunctuations parameter value is true, then a window punctuation and final punctuation is generated based on the input data in the blob. Otherwise, a window punctuation and a final punctuation are generated when a final punctuation is received.

The output stream from the Parse operator must meet all the requirements of the first output stream of the FileSource operator, with respect to the format parameter. For example, if the format is block, then the output stream must have exactly one attribute of type blob that is not set in an output clause.

Properties

Parameters

This operator supports 11 parameters.

Optional: blockSize, defaultTuple, eolMarker, format, hasDelayField, hasHeaderLine, ignoreExtraCSVValues, parseInput, parsing, readPunctuations, separator

blockSize

Specifies the block size for the block format. For more information, see the blockSize parameter in the spl.adapter::FileSource operator.

Properties

defaultTuple

Specifies the default tuple to use for missing fields. For more information, see the defaultTuple parameter in the spl.adapter::FileSource operator.

Properties

eolMarker

Specifies the end of line marker. For more information, see the eolMarker parameter in the spl.adapter::FileSource operator.

Properties

format

Specifies the format of the data. For more information, see the format parameter in the spl.adapter::FileSource operator.

Properties

hasDelayField

Specifies whether the format contains inter-arrival delays as the first field. For more information, see the hasDelayField parameter in the spl.adapter::FileSource operator.

Properties

hasHeaderLine

Specifies to ignore the first line or lines of the file in CSV format. For more information, see the hasHeaderLine parameter in the spl.adapter::FileSource operator.

Properties

ignoreExtraCSVValues

Specifies whether to skip any extra fields before end of line when reading in CSV format. For more information, see the ignoreExtraCSVValues parameter in the spl.adapter::FileSource operator.

Properties

parseInput

Specifies which input attribute is parsed.

If this parameter is not specified, the input stream must contain only one attribute of type blob.

Note: Because this parameter must be of type blob, the data is binary in the sense that it is written as a sequence of bytes to a blob data type. However, that binary data can represent different formats, such as txt, line, or bin.

Properties

parsing

Specifies the parsing mode. For more information, see the parsing parameter in the spl.adapter::FileSource operator.

Properties

readPunctuations

Specifies whether to read punctuations from bin format input. For more information, see the readPunctuations parameter in the spl.adapter::FileSource operator.

Properties

separator

Specifies the separator character for the csv format. For more information, see the separator parameter in the spl.adapter::FileSource operator.

Properties

Code Templates

Parse

stream<${streamType}> ${streamName} = Parse(${inputSchema}) {
            param
                format : "${format}";
        }
      

Metrics

nInvalidTuples - Counter

The number of tuples that failed to read correctly in csv or txt format.

Libraries

spl-std-tk-lib
Library Name: streams-stdtk-runtime
Include Path: ../../../impl/include