SFTP/FTP/FTPS Client
The SFTP/FTP/FTPS Client origin reads files from a server using the Secure File Transfer Protocol (SFTP), File Transfer Protocol (FTP), or FTP Secure (FTPS) protocol. For information about supported versions, see Supported Systems and Versions.
The origin uses the file name as the offset and does not reprocess a file with a changed timestamp. As a result, the files to be processed must be fully written. The origin does not support reading data from an active file that is still being written to.
When you configure the SFTP/FTP/FTPS Client origin, you specify the protocol to use and the URL where the files reside on the remote server. You can also use a connection to configure the origin. You can specify whether to process files in subdirectories, a file name pattern, and the first file to process. You can use glob patterns or regular expressions to define the file name pattern that you want to use.
When needed, you can connect to the server through an HTTP or SOCKS proxy. You can also specify a file processing delay.
If the server requires authentication, configure the credentials for the protocol you are using. For the SFTP protocol, the origin can require that the server be listed in a known hosts file. For the FTPS protocol, the origin can authenticate with the server using a client certificate and can authenticate the certificate from the FTPS server.
You can configure the origin to download files to an archive directory if the origin encounters errors while reading the files.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
After processing a file, the origin can keep, archive, or delete the file.
File Name Pattern and Mode
Use a file name pattern to define the files that the SFTP/FTP/FTPS Client origin processes. You can use either a glob pattern or a regular expression to define the file name pattern.
The SFTP/FTP/FTPS Client origin processes files at the specified path based on the file name pattern mode and file name pattern. When processing subdirectories, the origin uses the same pattern to locate file names in the subdirectories. The origin does not use the pattern to locate subdirectories.
When you specify a glob pattern, you can use UNIX-style wildcards, such as * or ?. For
example, the pattern ??a
represents three-character file names which
end with a
. The pattern *.txt
represents file names of
one or more characters ending with .txt
.
In a glob pattern, you cannot use a tilde (~) or slash (/). You cannot use a period (.) at the beginning of the pattern. The origin treats a period as a literal in other spots in the pattern.
The origin processes files in order based on the specified read order.
For more information about glob syntax, see the Oracle Java documentation. For more information about regular expressions, see Regular Expressions Overview.
Default is *
, which processes all files.
Read Order
cp
-p
. Preserving the existing timestamp can be problematic in some cases,
such as moving files across time zones.When ordering based on timestamp, any files with the same timestamp are read in lexicographically ascending order based on the file names.
log*.json
file name pattern,
the origin reads the following files in the following order:
File Name
|
Last Modified Timestamp
|
log-1.json
|
APR 24 2016 14:03:35
|
log-0054.json
|
APR 24 2016 14:05:03
|
log-0055.json
|
APR 24 2016 14:45:11
|
log-2.json
|
APR 24 2016 14:45:11
|
First File for Processing
Configure a first file for processing when you want the SFTP/FTP/FTPS Client origin to ignore one or more existing files in the directory.
When you define a first file to process, the origin starts processing with the specified file and continues processing files in the expected read order: files that match the file name pattern in ascending order based on the last-modified timestamp.
When you do not specify a first file, the origin processes the files in the directory that match the file name pattern, starting with the earliest file and continuing in ascending order.
Credentials
The SFTP/FTP/FTPS Client origin can use several methods to authenticate with the remote server. From the Credentials tab, configure the authentication required by the remote server.
Authentication options differ for each protocol:
- For all protocols, select an authentication method to log in to
the remote server. Choose the method based on the protocol and remote server
requirements:
- None - The stage does not authenticate with the server.
- Password - The stage authenticates with the server using a user name and password. You must specify the user name and password.
- Private key - The stage authenticates using a private key. Use only with the SFTP protocol. You must specify the private key, either in a local file or in plain text.
- For the SFTP protocol, the stage can require that the server be listed in a known hosts file. You must specify the path to the known hosts file that contains the host keys for the approved SFTP servers.
- For the FTPS protocol, the stage can use certificates to authenticate with the server. You must specify the keystore file and password. You can also configure the stage to authenticate the server by specifying a truststore provider. For more information about keystores and truststores, see Keystore and Truststore Configuration.
Record Header Attributes
The SFTP/FTP/FTPS Client origin creates record header
attributes that include information about the originating file for
the record. When the origin processes Avro data, it includes the Avro schema in
an avroSchema
record header attribute.
You can use the record:attribute
or
record:attributeOrDefault
functions to access the information
in the attributes. For more information about working with record header attributes,
see Working with Header Attributes.
- avroSchema - When processing Avro data, provides the Avro schema.
- filename - Provides the name of the file where the record originated.
- file - Provides the file path and file name where the record originated.
- mtime - Provides the last-modified time for the file.
- remoteUri - Provides the resource URL used by the stage.
Event Generation
The SFTP/FTP/FTPS Client origin can generate events that you can use in an event stream. When you enable event generation, the origin generates event records each time the origin starts or completes reading a file. It can also generate events when it completes processing all available data and the configured batch wait time has elapsed.
- With the SFTP/FTP/FTPS Client executor to move a file after processing it.
For an example, see Managing Output Files.
- With the Pipeline Finisher executor to
stop the pipeline and transition the pipeline to a Finished state when
the origin completes processing available data.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Stopping a Pipeline After Processing All Available Data.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The SFTP/FTP/FTPS Client origin can generate the following types of event records:
- new-file
- The SFTP/FTP/FTPS Client origin generates a new-file event record when it starts processing a new file.
- finished-file
- The SFTP/FTP/FTPS Client origin generates a finished-file event record when it finishes processing a file.
- no-more-data
- The SFTP/FTP/FTPS Client origin generates a no-more-data event record when the origin completes processing all available records and the number of seconds configured for Batch Wait Time elapses without any new files appearing to be processed.
Post Processing
After processing files in data formats other than whole file, the SFTP/FTP/FTPS Client origin can keep, archive, or delete the files.
- <archive> is the archive directory specified on the Post Processing tab. You can specify an absolute path to the archive directory or a path relative to the home directory of the user that the origin logs in as.
- <source> is included when processing subdirectories. The origin creates a source directory that matches each subdirectory processed.
For example, suppose you have files in the /home/data/orders directory on a remote host. You configure the origin to read files from the /home/data directory and its subdirectories. You also configure the origin to archive processed files to the /home/archive directory. After processing the files, the origin moves the files to the /home/archive/orders directory.
Note that your choice to specify the archive directory relative to the user's home directory is independent of your choice to specify the original location of the files relative to the user's home directory.
Data Formats
The SFTP/FTP/FTPS Client origin processes data differently based on the data format. SFTP/FTP/FTPS Client processes the following types of data:
- Avro
- Generates a record for every Avro record. Includes a
precision
andscale
field attribute for each Decimal field. - Delimited
- Generates a record for each delimited line.
- Excel
- Generates a record for every row in the file. Can process
.xls
or.xlsx
files.You can configure the origin to read from all sheets in a workbook or from particular sheets in a workbook. You can specify whether files include a header row and whether to ignore the header row. You can also configure the origin to skip cells that do not have a corresponding header value. A header row must be the first row of a file. Vertical header columns are not recognized.
The origin cannot process Excel files with large numbers of rows. You can save such files as CSV files in Excel, and then use the origin to process with the delimited data format.
- JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- Log
- Generates a record for every log line.
- Protobuf
- Generates a record for every protobuf message.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.
- Whole File
- Streams whole files from the origin system to the destination system. You can specify a transfer rate or use all available resources to perform the transfer.
- XML
- Generates records based on a user-defined delimiter element. Use an XML element directly under the root element or define a simplified XPath expression. If you do not define a delimiter element, the origin treats the XML file as a single record.
Configuring an SFTP/FTP/FTPS Client Origin
Configure an SFTP/FTP/FTPS Client origin to read files from an SFTP, FTP, or FTPS server.