Sources

A source stage represents the source data for the flow. You can use a single source stage in a flow.

To help create or test flows, you can use development sources.

You can use the following sources:
  • Amazon S3 - Reads objects from Amazon S3. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Amazon SQS Consumer - Reads data from queues in Amazon Simple Queue Services (SQS). Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Aurora PostgreSQL CDC Client - Reads Amazon Aurora PostgreSQL WAL data to generate change data capture records.
  • Azure Blob Storage - Reads data from Microsoft Azure Blob Storage. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Azure Data Lake Storage Gen2 - Reads data from Microsoft Azure Data Lake Storage Gen2. Creates multiple threads to enable parallel processing in a multithreaded flow. Use this source for new development.
  • Azure Data Lake Storage Gen2 (Legacy) - Reads data from Microsoft Azure Data Lake Storage Gen2. Creates multiple threads to enable parallel processing in a multithreaded flow. Do not use this source for new development.
  • Azure IoT/Event Hub Consumer - Reads data from Microsoft Azure Event Hub. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • CoAP Server - Listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Couchbase - Reads JSON data from Couchbase Server. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Directory - Reads fully-written files from a directory. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Elasticsearch - Reads data from an Elasticsearch cluster. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • File Tail - Reads lines of data from an active file after reading related archived files in the directory.
  • Google BigQuery - Executes a query job and reads the result from Google BigQuery.
  • Google Cloud Storage - Reads fully written objects from Google Cloud Storage.
  • Google Pub/Sub Subscriber - Consumes messages from a Google Pub/Sub subscription. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Groovy Scripting - Runs a Groovy script to create Data Collector records. Can create multiple threads to enable parallel processing in a multithreaded flow.
  • HTTP Client - Reads data from a streaming HTTP resource URL.
  • HTTP Server - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • IBM Db2 - Reads data from an IBM Db2 database. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • JavaScript Scripting - Runs a JavaScript script to create Data Collector records. Can create multiple threads to enable parallel processing in a multithreaded flow.
  • JDBC Multitable Consumer - Reads database data from multiple tables through a JDBC connection. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • JDBC Query Consumer - Reads database data using a user-defined SQL query through a JDBC connection.
  • Jira - Reads data from a Jira instance
  • JMS Consumer - Reads messages from JMS.
  • Jython Scripting - Runs a Jython script to create Data Collector records. Can create multiple threads to enable parallel processing in a multithreaded flow.
  • Kafka Multitopic Consumer - Reads messages from multiple Kafka topics. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Kinesis Consumer - Reads data from Kinesis Streams, DynamoDB, and CloudWatch. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • MongoDB Atlas - Reads documents from MongoDB Atlas or MongoDB Enterprise Server.
  • MongoDB Atlas CDC - Reads changes from a MongoDB Change Stream or Oplog.
  • MQTT Subscriber - Subscribes to a topic on an MQTT broker to read messages from the broker.
  • MySQL Binary Log - Reads MySQL binary logs to generate change data capture records.
  • OPC UA Client - Reads data from a OPC UA server.
  • Oracle Bulkload - Reads data from multiple Oracle database tables, then stops the flow. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Oracle CDC - Processes changed data from Oracle using LogMiner. Use this source for new development.
  • Oracle CDC Client - Processes changed data from Oracle using LogMiner. This is the older Oracle source. Use the Oracle CDC source for new development.
  • Oracle Multitable Consumer - Reads data from multiple Oracle database tables.
  • Oracle XStream - Processes changed data from Oracle using XStream.
  • PostgreSQL CDC Client - Reads PostgreSQL WAL data to generate change data capture records.
  • Pulsar Consumer - Reads messages from Apache Pulsar topics. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • Pulsar Consumer (Legacy) - Reads messages from Apache Pulsar topics.
  • RabbitMQ Consumer - Reads messages from RabbitMQ.
  • Redis Consumer - Reads messages from Redis.
  • REST Service - Listens on an HTTP endpoint, parses the contents of all authorized requests, and sends responses back to the originating REST API. Creates multiple threads to enable parallel processing in a multithreaded flow. Use only in microservice flows.
  • Salesforce - Reads data from Salesforce using the SOAP or Bulk API.
  • Salesforce Bulk API 2.0 - Reads data from Salesforce using Salesforce Bulk API 2.0. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • SAP HANA Query Consumer - Reads data from an SAP HANA database using a user-defined SQL query.
  • SFTP/FTP/FTPS Client - Reads files from an SFTP, FTP, or FTPS server.
  • Snowflake Bulk - Reads data from Snowflake tables, then stops the flow. Creates multiple threads to enable parallel processing in a multithreaded flow.
  • SQL Server CDC Client - Reads data from Microsoft SQL Server CDC tables.
  • SQL Server Change Tracking - Reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record.
  • TCP Server - Listens at the specified ports and processes incoming data over TCP/IP connections. Creates multiple threads to enable parallel processing in a multithreaded flow.

Development sources

To help create or test flows, you can use the following development sources:
  • Dev Data Generator
  • Dev Random Source
  • Dev Raw Data Source

For more information, see Development stages.

Comparing Azure storage sources

We have several Azure storage sources, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:

Source Description
Azure Blob Storage
  • Accesses data using the Microsoft Azure Blob Storage API.
  • Connects to an Azure Blob Storage account using the following format for the Fully Qualified Domain Name (FQDN) of the account:

    <storage account name>.blob.core.windows.net

  • Supports the following Azure authentication methods:
    • OAuth with Service Principal
    • Azure Managed Identity
    • Shared Key
    • SAS Token
  • Processes all data formats, except for Datagram.
  • When archiving successfully processed objects, can copy or move the objects to another container or file system.
  • Can include Azure Blob Storage system-defined and custom metadata in record header attributes. Can also include user-defined metadata.
Azure Data Lake Storage Gen2
  • Accesses data using the Microsoft Azure Data Lake Storage Gen2 API.
  • Connects to an Azure Data Lake Storage Gen2 account using the following format for the Fully Qualified Domain Name (FQDN) of the account:

    <storage account name>.dfs.core.windows.net

  • Supports the following Azure authentication methods:
    • OAuth with Service Principal
    • Azure Managed Identity
    • Shared Key
  • Processes all data formats, except for Binary and Datagram.
  • When archiving successfully processed objects, can only move the objects to the same container or file system.
  • Can include Azure Data Lake Storage Gen2 system-defined and custom metadata in record header attributes.
Azure Data Lake Storage Gen2 (Legacy)
  • Accesses data using the Hadoop FileSystem interface.
  • For all new development, use one of the other Azure storage sources which provide better performance.

Comparing HTTP sources

We have several HTTP sources, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:
Source Description
HTTP Client
  • Initiates HTTP requests for an external system.
  • Processes data synchronously.
  • Processes JSON, text, and XML data.
  • Can process a range of HTTP requests.
  • Can be used in a flow with processors.

HTTP Server
  • Listens for incoming HTTP requests and processes them while the sender waits for confirmation.
  • Processes data synchronously.
  • Creates multithreaded flows, thus suitable for high throughput of incoming data.
  • Processes virtually all data formats. Processes HTTP POST and PUT requests.
  • Can be used in a flow with processors.
Web Client
  • Initiates HTTP requests for an external system.

  • Processes data synchronously.

  • Processes virtually all data formats.

  • Can be configured to process different data formats for request data and response data.

  • Can be configured with per-timeout actions.

  • Can process a range of HTTP requests.

  • Can be used in a flow with processors.

Comparing UDP Source sources

The UDP Source and UDP Multithreaded Source sources are very similar. The main differentiator is that the UDP Multithreaded Source can use multiple threads to process data within the flow.

The UDP Multithreaded Source has a processing queue that aids multithreaded processing. But use of this queue can slow processing under certain circumstances.

The following table describes some cases when you might want to use each source:
Source Ideally Used When
UDP Multithreaded Source
  • Epoll support enables the use of multiple receiver threads to pass data to the flow.
  • Complex flow requires longer processing time.

or

  • Lack of epoll support allows only a single receiver thread to pass data to the flow.
  • High volumes of data.
UDP Source
  • Epoll support enables the use of multiple receiver threads to pass data to the flow.
  • Relatively simple flow enables speedy Data Collector processing.

Comparing WebSocket sources

We have two WebSocket sources, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:

Source Description
WebSocket Client
  • Initiates a connection to a WebSocket server endpoint and then waits for the WebSocket server to push data.
WebSocket Server
  • Listens for incoming WebSocket requests and processes them while the sender waits for confirmation.
  • Creates multithreaded flows, thus suitable for high throughput of incoming data.

Batch size and wait time

For source stages, the batch size determines the maximum number of records sent through the flow at one time. The batch wait time determines the time that the source waits for data before sending a batch. At the end of the wait time, it sends the batch regardless of how many records the batch contains.

For example, a File Tail source is configured for a batch size of 20 records and a batch wait time of 240 seconds. When data arrives quickly, File Tail fills a batch with 20 records and sends it through the flow immediately, creating a new batch and sending it again as soon as it is full. As incoming data slows, a remaining batch contains a few records, gaining an extra record periodically. 240 seconds after creating the batch, File Tail sends the partially-full batch through the flow. It immediately creates a new batch and starts a new countdown.

Configure the batch wait time based on your processing needs. You might reduce the batch wait time to ensure all data is processed within a specified time frame or to make regular contact with flow targets. Use the default or increase the wait time if you prefer not to process partial or empty batches.

Maximum record size

Most data formats have a property that limits the maximum size of the record that a source can parse. For example, the delimited data format has a Max Record Length property, the JSON data format has Max Object Length, and the text data format has Max Line Length.

When the source processes data that is larger than the specified length, the behavior differs based on the source and the data format. For example, with some data formats, oversized records are handled based on the record error handling configured for the source. While in other data formats, the source might truncate the data. For details on how a source handles size overruns for each data format, see the "Data Formats" section of the source documentation.

When available, the maximum record size properties are limited by the Data Collector parser buffer size, which is 1048576 bytes by default. So, when raising the maximum record size property in the source does not change the source's behavior, you might need to increase the Data Collector parser buffer size by configuring the parser.limit property in the Data Collector configuration properties.

Note that most of the maximum record size properties are specified in characters, while the Data Collector limit is defined in bytes.