MapR Multitopic Streams Consumer
The MapR Multitopic Streams Consumer origin reads data from multiple topics in a MapR Streams cluster. The origin can use multiple threads to enable parallel processing of data.
When you configure a MapR Multitopic Streams Consumer, you configure the consumer group name, the topics to process, and the number of threads to use.
You can configure the origin to produce a single record when a message includes multiple objects. And you can add additional MapR Streams and supported Kafka configuration properties as needed.
When processing Avro data, you can configure the MapR Multitopic Streams Consumer to work with the Confluent Schema Registry. The Confluent Schema Registry is a distributed storage layer for Avro schemas which uses MapR Streams as its underlying storage mechanism.
MapR Multitopic Streams Consumer includes record header attributes that enable you to use information about the record in pipeline processing.
Before you use any MapR stage in a pipeline, you must perform additional steps to enable Data Collector to process MapR data. For more information, see MapR Prerequisites in the Data Collector documentation.
Initial and Subsequent Offsets
When you start a pipeline for the first time, the MapR Multitopic Streams Consumer becomes a new consumer group for each specified topic.
By default, the origin reads only incoming data, processing data from all partitions and ignoring any existing data in the topic. After the origin passes data to destinations, it saves the offset with MapR Streams. When you stop and restart the pipeline, processing continues based on the offset.
Processing All Unread Data
You can configure the MapR Multitopic Streams Consumer origin to read all unread data in a topic. By default, the MapR Multitopic Streams Consumer origin reads only incoming data.
auto.offset.reset
MapR
Streams configuration property to the origin:- On the Connection tab, click the Add
icon to add a new MapR Streams configuration property.
You can use simple or bulk edit mode to add configuration properties.
- For the property name, enter auto.offset.reset.
- Set the value for the
auto.offset.reset
property to earliest.
For more information about auto.offset.reset,
see the MapR Streams
documentation.
Multithreaded Processing
The MapR Multitopic Streams Consumer origin performs parallel processing and enables the creation of a multithreaded pipeline. The MapR Multitopic Streams Consumer origin uses multiple concurrent threads based on the Number of Threads property. MapR Streams distributes partitions equally among all the consumers in a group.
When performing multithreaded processing, the MapR Multitopic Streams Consumer origin checks the list of topics to process and creates the specified number of threads. Each thread connects to MapR Streams and creates a batch of data from a partition assigned by MapR Streams. Then, it passes the batch to an available pipeline runner.
A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For example, say you set the Number of Threads property to 5. When you start the pipeline, the origin creates five threads, and Data Collector creates a matching number of pipeline runners. The threads are assigned to different partitions as defined by MapR Streams. Upon receiving data, the origin passes a batch to each of the pipeline runners for processing.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview. For more information about the MapR Streams, see the MapR Streams documentation.
Additional Properties
You can add custom configuration properties to the MapR Multitopic Streams Consumer. You can use any MapR or Kafka property supported by MapR Streams. For more information, see the MapR Streams documentation.
When you add a configuration property, enter the exact property name and the value. The MapR Multitopic Streams Consumer does not validate the property names or values.
If custom configurations conflict with other stage properties, the stage generates an error unless you select the Override Stage Configurations check box. With the check box selected, the custom configurations override other stage properties. For information about the necessary properties, see the MapR documentation.
- auto.commit.interval.ms
- enable.auto.commit
- group.id
Record Header Attributes
The MapR Multitopic Streams Consumer origin creates record header
attributes that include information about the originating file for
the record. When the origin processes Avro data, it includes the Avro schema in
an avroSchema
record header attribute.
You can use the record:attribute
or
record:attributeOrDefault
functions to access the information
in the attributes. For more information about working with record header attributes,
see Working with Header Attributes.
- avroSchema - When processing Avro data, provides the Avro schema.
- offset - The offset where the record originated.
- partition - The partition where the record originated.
- topic - The topic where the record originated.
Data Formats
The MapR Multitopic Streams Consumer origin processes data differently based on the data format. MapR Multitopic Streams Consumer can process the following types of data:
- Avro
- Generates a record for every message. Includes a
precision
andscale
field attribute for each Decimal field. - Binary
- Generates a record with a single byte array field at the root of the record.
- Datagram
- Generates a record for every message. The origin can process collectd messages, NetFlow 5 and NetFlow 9 messages, and the following types of syslog messages:
- Delimited
- Generates a record for each delimited line.
- JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- Log
- Generates a record for every log line.
- Protobuf
- Generates a record for every protobuf message. By default, the origin assumes messages contain multiple protobuf messages.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.
- XML
- Generates records based on a user-defined delimiter element. Use an XML element directly under the root element or define a simplified XPath expression. If you do not define a delimiter element, the origin treats the XML file as a single record.