TCP Server
The TCP Server origin listens at the specified port numbers, establishes TCP sessions with clients that initiate TCP connections, and then processes the incoming data. The origin can process data from tables with simple numeric primary keys. The origin cannot process data from tables with compound or non-numeric primary keys. For information about supported versions, see Supported Systems and Versions.
The origin can operate in different modes. The modes determine the messages that it can process. It can process NetFlow messages or syslog messages. It can also process supported Data Collector data formats passed as data separated by specified characters, passed as character-based data with length prefixes, or passed in Flume events as Avro messages.
The TCP Server can process data from multiple clients simultaneously, creating separate batches for each client, and sending acknowledgements to the originating client after parsing each record or committing each batch. You can configure the origin to use multiple threads to improve performance when processing of large volumes of data. And on 64-bit Linux systems, you can enable native Epoll transports to further improve performance.
When a pipeline stops, the TCP Server origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped by default. You can reset the origin to process all requested data.
When you configure the TCP Server origin, you specify the ports to use and the TCP mode that indicates the type of data the origin will receive. Then you configure mode-related properties, such as the characters that separate records.
You can optionally configure the acknowledgements that you want to send and the amount of time that the origin waits to receive data before closing the connection. You can also configure SSL/TLS properties, including default transport protocols and cipher suites.
Multithreaded Processing
The TCP Server origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you enable multithreaded processing, the TCP Server origin uses multiple concurrent threads based on the Number of Receiver Threads property. When you start the pipeline, the origin creates the number of threads specified in the property.
As clients initiate TCP connections, the origin establishes TCP sessions and waits for data. Upon filling a batch, the origin passes the batch to an available pipeline runner.
A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For example, say you enable multithreaded processing and set the Number of Receiver Threads property to 5. When you start the pipeline, the origin creates five threads, and Data Collector creates a matching number of pipeline runners. Upon receiving data, the origin passes a batch to each of the pipeline runners for processing.
Each pipeline runner performs the processing associated with the rest of the pipeline. After a batch is written to pipeline destinations, the pipeline runner becomes available for another batch of data. Each batch is processed and written as quickly as possible, independent from other batches processed by other pipeline runners, so batches may be written differently from the read order.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Closing Connections for Invalid Data
When the TCP Server origin receives invalid data, it closes the connection to the TCP client that sent the data. It also passes the data to the pipeline for error handling.
For example, when you configure the origin, you specify the maximum record size. When a TCP client sends a message that translates to larger than the maximum record size, the origin disconnects from the client and passes the message to the pipeline for error handling.
Similarly, say the TCP Server origin is configured to process XML data. If the origin receives an invalid XML document, it disconnects from the sending client and passes the data to the pipeline for error handling.
Sending Acknowledgements
You can configure the TCP Server origin to send acknowlegements, a.k.a. acks., to the originating client. The acknowledgement message can be a simple text message, such as "Ack". Or, you can use the expression language to include additional information in the message.
- record processed acknowledgement
- When you configure a record processed acknowledgement, the origin sends acks after it receives and processes each record. It sends the ack after parsing a record from the incoming data.
- batch completed acknowledgement
- When you configure a batch completed acknowledgement, the origin sends acks after the pipeline completes processing the batch. It sends the ack after the batch is committed to all destinations.
Using Expressions in Messages
You can use the Data Collector expression language to create custom acknowledgement messages. You might use expressions to include information about Data Collector, the pipeline, record, or batch in the message.
${record:value('/id')} was processed by Data Collector: ${sdc:hostname()},
pipeline: ${pipeline:title()}.
You can set the time zone to use for datetime values returned by expressions. By default, the origin uses UTC.
Pipeline: ${pipeline:title()} committed a batch whose last record was
${record:value('/transactionID')} and included ${batchSize} messages.
TCP Modes
- NetFlow messages
- The TCP Server origin can process NetFlow 5 and NetFlow 9 messages. When processing NetFlow messages, the stage generates different records based on the NetFlow version. When processing NetFlow 9, the records are generated based on the NetFlow 9 configuration properties. For more information, see NetFlow Data Processing.
- syslog messages
- The TCP Server origin processes syslog messages in accordance with RFC 6587, except the origin does not support method changes.
- Separated records
- The TCP Server origin can process the supported Data Collector data formats when the data is separated by the specified record separator characters.
- Character data with length prefix
- The TCP Server origin can process the supported Data Collector data formats when passed as character-based data with a length prefix.
- Flume Avro IPC Server
- The TCP Server origin can process the supported Data Collector data formats when passed in Flume events as Avro messages from a Flume Avro sink. Use the TCP Server origin instead of the HTTP Server origin to more efficiently process Flume events.
Data Formats
In Separated Record or Character Data with Length Prefix TCP mode, the TCP Server origin processes data differently based on the data format.
- Avro
- Generates a record for every message. Includes a
precision
andscale
field attribute for each Decimal field. - Binary
- Generates a record with a single byte array field at the root of the record.
- Delimited
- Generates a record for each delimited line.
- JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- Log
- Generates a record for every log line.
- Protobuf
- Generates a record for every protobuf message. By default, the origin assumes messages contain multiple protobuf messages.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.
- XML
- Generates records based on a user-defined delimiter element. Use an XML element directly under the root element or define a simplified XPath expression. If you do not define a delimiter element, the origin treats the XML file as a single record.