UDP Multithreaded Source

The UDP Multithreaded Source source reads messages from one or more UDP ports. The source can create multiple worker threads to enable parallel processing in a multithreaded flow. For information about supported versions, see Supported systems and versions.

UDP Multithreaded Source generates a record for every message. UDP Multithreaded Source can process collectd messages, NetFlow 5 and NetFlow 9 messages, and the following types of syslog messages:

RFC 5424
RFC 3164
Non-standard common messages, such as RFC 3339 dates with no version digit

When processing NetFlow messages, the stage generates different records based on the NetFlow version. When processing NetFlow 9, the records are generated based on the NetFlow 9 configuration properties. For more information, see NetFlow data processing.

The source can also read binary or character-based raw data.

When you configure UDP Multithreaded Source, you specify the ports to use and the batch size and wait time. You specify the number of worker threads to use in multithreaded processing and you can specify the packet queue size. When epoll is available on the Data Collector machine, you can also specify the number of receiver threads to use to increase the throughput of packets to the flow.

You specify the data format for the data, then configure any related properties.

When a flow stops, the source notes where it stops reading. When the flow starts again, the source continues processing from where it stopped by default. You can reset the offset to process all requested data.

Processing raw data

Use the Raw/Separated Data data format to enable the UDP Multithreaded Source source to generate records from binary or character-based raw data.

When processing raw data, the source can generate a record for each UDP packet that it receives. Or, if you specify a separator character, then the source can generate multiple records from each UDP packet.

When generating multiple records, you specify the multiple value behavior: one record with only the first value, one record with all values as a list, or multiple records with one record for each value.

You can optionally specify an output field to use for the data. When not specified, the source writes the raw data to the root field.

You might use the Raw/Separated Data data format to write raw data to a field that you later process using the Data Parser processor. This allows you to retain the raw data for another use.

Receiver and worker threads

The UDP Multithreaded Source source uses both of the following types of threads:

Receiver threads

Used to pass data from the operating system socket to the source's packet queue. By default, the source uses a single receiver thread.

You can configure the source to use multiple receiver threads when Data Collector runs on a machine enabled for epoll. Epoll requires native libraries and is only available when Data Collector runs on recent versions of 64-bit Linux.

When you enable multiple receiver threads, you increase the rate that data can be passed to the source, but at the cost of a standard increase of overhead for thread management.

To use additional receiver threads, select the Use Native Transports (epoll) property, and then configure Number of Receiver Threads.

Worker threads

Used to perform multithreaded flow processing. By default, the source uses a single thread for flow processing. You can increase the number of threads to use to perform parallel processing of larger volumes of data. For more information, see Multithreaded flows.

To use additional worker threads for parallel processing, increase the Number of Worker Threads property.

Packet queue

The UDP Multithreaded Source source uses a packet queue to hold incoming data in memory until the data can be incorporated in a batch and passed through the flow. When the packet queue is full, incoming packets are dropped. The number of packets that are dropped is noted in stage metrics.

When you configure the source, you can specify the maximum number of packets to allow in the queue. The default is 200,000. Because the packet queue uses Data Collector heap memory, when increasing the size of the queue, you should consider increasing the as well.

Multithreaded flows

The UDP Multithreaded Source source performs parallel processing and enables the creation of a multithreaded flow.

When you enable multithreaded processing, the UDP Multithreaded Source source uses multiple concurrent threads for flow processing based on the Number of Worker Threads property. When you start the flow, the source creates the number of threads specified in the property.

As packets arrive from the specified UDP ports, they enter the packet queue. There is a single instance of the packet queue per flow. All receiver threads (which can be more than one, when using epoll) place packets onto the queue. At the same time, each worker thread removes packets from the queue, parses them according to the specified data format, and processes the rest of the flow using a flow runner.

A flow runner is a sourceless flow instance - an instance of the flow that includes all of the processors, executors, and targets in the flow and handles all flow processing after the source. Each flow runner processes one batch at a time, just like a flow that runs on a single thread. When the flow of data slows, the flow runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time flow property to specify the interval or to opt out of empty batch generation.

Multithreaded flows preserve the order of records within each batch, just like a single-threaded flow. But since batches are processed by different flow runners, the order that batches are written to targets is not ensured.

For example, say you enable multithreaded processing and set the Number of Worker Threads property to 5. When you start the flow, the source creates five threads, and Data Collector creates a matching number of flow runners. The source adds incoming data to the packet queue, creates batches of data from the queue and then passes the batches to the flow runners for processing.

Each flow runner performs the processing associated with the rest of the flow. After a batch is written to flow targets, the flow runner becomes available for another batch of data. Each batch is processed and written as quickly as possible, independent from other batches processed by other flow runners, so batches may be written differently from the read order.

At any given moment, the five flow runners can each process a batch, so this multithreaded flow processes up to five batches at a time. When incoming data slows, the flow runners sit idle, available for use as soon as the data flow increases.

For more information about multithreaded flows, see Multithreaded flow overview.

Metrics for performance tuning

The UDP Multithreaded Source source provides packet queue metrics that you can use to tune flow performance.

The source provides the following packet queue metrics:

Dropped Packets - The number of packets that were dropped because the packet queue was full.
Queue Size - The current size of the packet queue.
Queued Packets - The total number of packets that have passed through the packet queue for processing.

These metrics can help you determine how to improve flow performance. For example, if you have a high volume of dropped packets and the queue size seems to be maxed out as you monitor the flow, you might increase the number of worker threads for the flow to allow for greater throughput. Or, if you have relatively high bursts of data volume and find packets getting dropped during those bursts, consider increasing the packet queue size to better accommodate them.

If the queue size is not maxed out, but the number of queued packets does not seem as high as you expect, you might be dropping packets on the operating system side. When epoll is available - that is, when Data Collector runs on recent versions of 64-bit Linux - increasing the number of receiver threads can increase the volume of packets that are passed to the source.

Configuring a UDP Multithreaded Source

About this task

Configure a UDP Multithreaded Source source to use multiple worker threads to process messages from one or more UDP ports.

Procedure

In the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the flow for error handling. Stop Flow - Stops the flow.

On the UDP tab, configure the following properties:

UDP Property	Description
Port	Port to listen to for data. Using simple or bulk edit mode, click the Add icon to list additional ports. To listen to a port below 1024, Data Collector must be run by a user with root privileges. Otherwise, the operating system does not allow Data Collector to bind to the port. Note: No other flows or processes can already be bound to the listening port. The listening port can be used only by a single flow.
Data Format	Data format passed by UDP: collectd NetFlow syslog Raw/separated data
Use Native Transports (epoll)	Specifies whether to use multiple receiver threads for each port. Using multiple receiver threads can improve performance. You can use multiple receiver threads using epoll, which can be available when Data Collector runs on recent versions of 64-bit Linux.
Number of Receiver Threads	Number of receiver threads to use for each port. For example, if you configure two threads per port and configure the source to use three ports, the source uses a total of six threads. Use to increase the number of threads passing data to the source when epoll is available on the Data Collector machine. Default is 1.
Max Batch Size (messages)	Maximum number of messages to include in a batch and pass through the flow at one time. Honors values up to the Data Collector maximum batch size. Default is 1000. The Data Collector default is 1000.
Batch Wait Time (ms)	Milliseconds to wait before sending a partial or empty batch.
Packet Queue Size	The maximum number of packets to hold in the packet queue for processing.
Number of Worker Threads	The number of threads that the source uses to perform flow processing.

On the syslog tab, define the character set for the data.

On the collectd tab, define the following collectd properties:

collectd Property	Properties
Convert Hi-Res Time & Interval	Converts the collectd high resolution time format interval and timestamp to UNIX time, in milliseconds.
Exclude Interval	Excludes the interval field from output record.
Auth File	Path to an optional authentication file. Use an authentication file to accept signed and encrypted data.
TypesDB File Path	Path to a user-provided types.db file. Overrides the default types.db file.
Charset	Character set of the data.

For raw data, on the Raw/Separated Data tab, define the following properties:

Raw/Separated Data Property	Description
Raw Data Mode	Type of raw data to process: binary or string data.
Output Field Path	Optional output field for the raw data. When not used, the source writes the raw data to the root field.
Multiple Values Behavior	Action to take when the data in the data separator generates multiple values from a UDP packet: First Value Only - Returns one record with the first value. All Values as a List - Returns one record with all values in a List. Split into Multiple Records - Returns multiple records, one record for each value.
Data Separator	Optional data separator to use to separate UDP packets to multiple values. Specify byte literals using Java Unicode syntax, \u<character code>. For example, the default line feed character is expressed as follows: `\u000A`.
Charset	Charset used by string data.

For NetFlow 9 data, on the NetFlow 9 tab, configure the following properties:

When processing earlier versions of NetFlow data, these properties are ignored.

Netflow 9 Property	Description
Record Generation Mode	Determines the type of values to include in the record. Select one of the following options: Raw Only Interpreted Only Both Raw and Interpreted
Max Templates in Cache	The maximum number of templates to store in the template cache. For more information about templates, see Caching NetFlow 9 templates. Default is -1 for an unlimited cache size.
Template Cache Timeout (ms)	The maximum number of milliseconds to cache an idle template. Templates unused for more than the specified time are evicted from the cache. For more information about templates, see Caching NetFlow 9 templates. Default is -1 for caching templates indefinitely.