Azure IoT/Event Hub Consumer
The Azure IoT/Event Hub Consumer origin reads data from Microsoft Azure Event Hub. The origin can use multiple threads to enable parallel processing of data from a single Azure event hub. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Before you use the Azure IoT/Event Hub Consumer origin, make sure you have the required Microsoft Azure storage account and container.
When you configure the Azure IoT/Event Hub Consumer, you specify the Microsoft Azure namespace and event hub names. You also define the shared access policy name and connection string key. You specify the consumer group to use and an event processor prefix that the origin uses when communicating with Azure Event Hub.
You configure the storage account details, such as the storage account name and key. And you specify the number of threads to use during processing.
Storage Account and Container Prerequisite
Before you use the Azure IoT/Event Hub Consumer origin, you need a Microsoft Azure storage account and at least one container.
The origin stores offsets in a storage account container, so to ensure the integrity of offset information, you must use a different container for each pipeline that includes an Azure IoT/Event Hub Consumer origin.
For example, say you use the Azure IoT/Event Hub Consumer as the origin for an IoT pipeline and a Transactions pipeline. To keep the offset data for these pipelines separate, you need to use two different storage account containers. They can be in the same storage account or in different storage accounts. When you configure the origins, you specify the storage account and container to use.
- Log into the Microsoft Azure portal: https://portal.azure.com
- In the Navigation panel, click Storage Accounts.
- Select the storage account to use.
If you need to create a storage account, click the Add icon. Enter a name for the storage account, and enter or select a resource group name. You can use the defaults for all other properties.
- In the storage account view, click + Container to create a container.
- Enter a container name, and click OK.Tip: Use a name that can be easily identified as the container for the pipeline that you want to use it in.
If these steps are no longer accurate, see the Microsoft Azure Event Hub documentation.
Resetting the Origin in Event Hub
You cannot use Data Collector to reset the origin for Azure IoT/Event Hub Consumer pipelines because the offset is stored in Azure Event Hub.
- In the Microsoft Azure portal, navigate to the storage account.
- To delete the offset information stored for the pipeline, delete the container
that the pipeline uses.
This can take some time. Allow the portal to complete the removal of the container before continuing.
- To enable the pipeline to store new offset information when you restart the pipeline, create a new container with the same name. Or, use a different name and update the Container Name property in the pipeline.
Multithreaded Processing
The Azure IoT/Event Hub Consumer origin performs parallel processing and enables the creation of a multithreaded pipeline.
The Azure IoT/Event Hub Consumer origin uses multiple concurrent threads to read from an event hub based on the Max Threads property. When you start the pipeline, the origin creates the number of threads specified in the Max Threads property. Each thread connects to the origin system, creates a batch of data, and passes the batch to an available pipeline runner.
A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For example, say you set the Max Threads property to 5. When you start the pipeline, the origin creates five threads, and Data Collector creates a matching number of pipeline runners. Upon receiving data, the origin passes a batch to each of the pipeline runners for processing.
Each pipeline runner performs the processing associated with the rest of the pipeline. After a batch is written to pipeline destinations, the pipeline runner becomes available for another batch of data. Each batch is processed and written as quickly as possible, independent from other batches processed by other pipeline runners, so batches may be written differently from the read order.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Data Formats
- Binary
- Generates a record with a single byte array field at the root of the record.
- JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.