Properties reference: File connector

This topic lists all properties that you can set to configure the stage.

Connection

For more information, see the Defining a connection topic.

File system

Select the file system to read files from or write files to.

Type: selection
Default: Local
Values:
- Local
- WebHDFS
- HttpFS
- NativeHDFS

Use custom URL

Select Yes to use custom URL instead of one generated based on Use SSL (HTTPS), Host, and Port

Type: boolean
Default: false

Use SSL (HTTPS)

Select Yes to use Secure Sockets Layer (HTTPS)

Type: boolean

For more information, see the Configuring the truststore topic.

SSL Truststore certificates

SSL truststore (X.509) certificates in the PEM format.

Type: string

Use Kerberos

Select Yes to use Kerberos

Type: boolean

Use keytab

Select Yes to use keytab instead of password for Kerberos login

Type: boolean

Host

Enter the name of the host.

Type: string

Port

Enter the port to connect to.

Type: integer

Service principal

Enter the service principal for the host. Use this property to specify the service principal for the host, if the realm of the host is different from the realm of the user. Service principal for the web server is of the format HTTP/FQDN@REALM

Type: string

User name

Enter the name of the user to connect as.

Type: string

Password

Enter the password for the specified user.

Type: protected string

Keytab

Enter the fully qualified path of the keytab for the specified user.

Type: string

Custom URL

Enter the base URL (http or https) for the WebHDFS or gateway server

Type: string

Use Proxy

Select Yes to use Proxy server

Type: boolean
Default: false

For more information, see the Configuring the Proxy topic.

Proxy host

Enter the host name of the Proxy server

Type: string

Proxy port

Enter the port for Proxy server

Type: integer

Proxy user name

Enter the name of the user to connect to Proxy

Type: string

Proxy password

Enter the password for the specified user

Type: protected string

HDFS HA connection options

Use this property when the HDFS high availability is enabled on the Hadoop cluster and the standby namenode and the name service details needs to be configured. The HDFS high availability is supported only for the NativeHDFS and WebHDFS FileSystem modes.

Type: boolean
Default: false

Nameservice ID

Use this property to specify the HDFS HA nameservice ID

Type: string
Default: hadoop.ha.nameservice

Standby namenode(s)

Use this property to specify a comma separated list of the standby namenode(s) along with the namenode port details. The host and the port should be separated by colon as shown in the example. HOST1:PORT1;HOST2:PORT2

Type: string

Standby namenode service principal

Use this property to specify the service principal for the host on which the standby namenode is configured. The service principal is required when the realm of the standby namenode host is different from the realm of the user principal accessing the HDFS. Service principal for the web server is of the format HTTP/FQDN@REALM

Type: string

Advanced configuration options

Use the properties in this category to set any of the advanced configuration options that needs to be set for any of the supported FileSystem modes.

Type: category

HDFS client parameters

Use this property to specify the HDFS Native client configuration parameters as key=value pairs. While running in the Hadoop cluster, these are typically set either in the core-site.xml or hdfs-site.xml. When using the Native HDFS FileSystem mode, additional configuration parameters can be passed to the HDFS Client using this property. The parameters should be separated by semi-colon as shown in the example, hadoop.rpc.protection=privacy;hadoop.tmp.dir=logdir;

Type: string

Usage

Is reject link

Select Yes if the link is configured to carry rejected data from the source stage.

Type: boolean
Default: false

Write mode

Select Write single file to write a file per node, select Write multiple files to write multiple files per node (based on size and/or key value), or select Delete to delete files.

Type: selection
Default: Write single file
Values:
- Write single file
- Write multiple files
- Delete

For more information, see the Configuring the File connector as a target topic.

Read mode

Select Read single file to read from a single file or Read multiple files to read from the files that match a specified file prefix. Select List buckets to list the buckets for your account in the specified region. Select List files to list files files for your account in the specified bucket.

Type: selection
Default: Read multiple files
Values:
- Read single file
- Read multiple files

Exclude files

Specify a comma-separated list of file prefixes to exclude from the files that are read. If a prefix includes a comma, escape the comma by using a backslash (\\).

Type: string

Wave handling properties

Use the properties under this category to define how the data should be handled when it is being streamed as waves (i.e in batches) from the upstream stage. Typically, this would be required for any source stages that are configured to send the data in waves

Type: category

Append unique identifier

Use this property to choose if a unique identifier is to be appended to the file name. When the value of this property is set to yes, then the file name gets appended with the unique identifier, and a new file would be written for every wave of data that is streamed into the stage. When the value of this property is set to No, then the file would be overwritten on every wave

Type: boolean
Default: false

File size threshold

Specify the threshold for the file size in megabytes. Processing nodes will start a new file each time the size exceeds the value specified in the threshold and on reaching the wave boundary. The file will be written only on the wave boundary and hence the threshold value specified is only a soft limit. The actual size of the file can be higher than the specified threshold depending on the size of the wave

Type: integer
Default: 1

If file exists

Specify what the connector does when it tries to write a file that already exists. Select Overwrite file to overwrite a file if it already exists, Do not overwrite file to not overwrite the file and stop the job, or Fail to stop the job with an error message.

Type: selection
Default: Overwrite file
Values:
- Overwrite file
- Do not overwrite file
- Fail

Split file on key changes

Select Yes to create a new file when key column value changes. Data must be sorted and partitioned for this to work properly.

Type: boolean
Default: false

Key column

Specify the key column to use for splitting files. If not specified, the connector will use the first key column on the link.

Type: string

Case sensitive

Select Yes to make the key value case sensitive.

Type: boolean
Default: false

Use key value in file name

Select Yes to use the key value in the generated file name.

Type: boolean
Default: false

Exclude partition string

Select Yes to exclude the partition string each processing node appends to the file name.

Type: boolean
Default: false

Maximum file size

Specify the maximum file size in megabytes. Processing nodes will start a new file each time the size exceeds this value.

Type: integer
Default: 0

Force sequential

Select Yes to run the connector sequentially on one node.

Type: boolean
Default: false

Reject mode

Specify what the connector does when a record that contains invalid data is found in the source file. Select Continue to read the rest of the file, Fail to stop the job with an error message, or Reject to send the rejected data to a reject link.

Type: selection
Default: Continue
Values:
- Continue
- Fail
- Reject

For more information, see the Rejecting records that contain errors topic.

Cleanup on failure

If a job fails, select whether the connector deletes the file or files that have been created.

Type: boolean
Default: true

File name column

Specify the name of the column to write the source file name to.

Type: string

File format

Specify the format of the files to read or write. The implicit file format specifies that the input to the file connector is in binary or string format without a delimiter.

Type: selection
Default: Delimited
Values:
- Delimited
- Comma-separated value (CSV)
- Avro
- Implicit
- orc
- parquet
- sequencefile

Avro format properties

Avro configuration properties

Type: category

Output as JSON

Specify if each rows in the avro file should be exported as JSON to a string column.

Type: boolean
Default: false

Avro format properties

Avro configuration properties

Type: category

Input as JSON

Specify if each rows in avro file should be imported from a JSON string.

Type: boolean
Default: false

Avro schema file

Specify the fully qualified path for a JSON file that defines the schema for the Avro file.

Type: string

Avro compression codec

Specify the compression algorithm that will be used to compress the data.

Type: selection
Default: None
Values:
- None
- Deflate
- Snappy
- Bzip2

Array keys

If the file format is Avro in a target stage, then normalization is controlled through array keys. Specify ''ITERATE()'' in the description for the corresponding array field in column definition in the input tab of file connector.

Type: string

ORC Settings

ORC configuration properties

Type: category

ORC Settings

ORC configuration properties

Type: category

Stripe Size

Stripe Size

Type: integer
Default: 100000

Buffer Size

Buffer Size

Type: integer
Default: 10000

Compression Kind

Specify Compression mechanism

Type: selection
Default: SNAPPY
Values:
- NONE
- ZLIB
- SNAPPY

Parquet settings

Parquet configuration properties

Type: category

Parquet settings

Parquet configuration properties

Type: category

Block size

Block size

Type: integer
Default: 10000000

Page size

Page size

Type: integer
Default: 10000

Compression type

Specify compression mechanism

Type: selection
Default: SNAPPY
Values:
- NONE
- SNAPPY
- GZIP
- LZO

Delimited format properties

Specify the file syntax for delimited files.

Type: category

Record limit

Specify the maximum number of records to read from the file per node. If a value is not specified for this property, the entire file is read.

Type: integer

Encoding

Specify the encoding of the files to read or write, for example, UTF-8.

Type: string

For more information, see the File encoding topic.

Include byte order mark

Specify whether to include a byte order mark in the file when the file encoding is a Unicode encoding such as UTF-8, UTF-16, or UTF-32.

Type: boolean
Default: false

Record definition

Select whether the record definition is provided to the connector from the source file, a delimited string, a file that contains a delimited string, or a schema file. When runtime column propagation is enabled, this metadata provides the column definitions. If a schema file is provided, the schema file overrides the values of formatting properties in the stage and the column definitions that are specified on the Columns page of the output link.

Type: selection
Default: None
Values:
- None
- File header
- Delimited string
- Delimited string in a file
- Schema file
- Infer schema

Definition source

If the record definition is a delimited string, enter a delimited string that specifies the names and data types of the files. Use the format name:data_type, and separate each field with the delimiter specified as the >B<Field delimiter>/B< property. If the record definition is in a delimited string file or Osh schema file, specify the full path of the file.

Type: string

For more information, see the Metadata formatting options topic.

First row is header

Select Yes if the first row of the file contains field headers and is not part of the data. If you select Yes, when the connector writes data, the field names will be the first row of the output. If runtime column propagation is enabled, metadata can be obtained from the first row of the file.

Type: boolean
Default: false

Include data types

Select Yes to append the data type to each field name that the connector writes in the first row of the output.

Type: boolean
Default: false

Field delimiter

Specify a string or one of the following values: <NL>, <CR>, <LF>, <TAB>. The string can include Unicode escape strings in the form \\uNNNN where NNNN is the Unicode character code.

Type: string
Default: ,

Row delimiter

Specify a string or one of the following values: <NL>, <CR>, <LF>, <TAB>. The string can include Unicode escape strings in the form \\uNNNN where NNNN is the Unicode character code.

Type: string
Default: <NL>

Escape character

Specify the character to use to escape field and row delimiters. If an escape character exists in the data, the escape character is also escaped. Because escape characters require additional processing, do not specify a value for this property if you do not need to include escape characters in the data.

Type: string

Quotation mark

Type: selection
Default: None
Values:
- None
- Double
- Single

Null value

Specify the character or string that represents null values in the data. For a source stage, input data that has the value that you specify is set to null on the output link. For a target stage, in the output file that is written to the file system, null values are represented by the value that is specified for this property. To specify that an empty string represents a null value, specify "" (two double quotation marks).

Type: string

Field formats

Type: category

For more information, see the Formatting options for Decimal, Time, Date, and Timestamp data types topic.

Decimal format

Specify a string that defines the format for fields that have the Decimal or Numeric data type.

Type: string

Date format

Specify a string that defines the format for fields that have the Date data type.

Type: string

Time format

Specify a string that defines the format for fields that have the Time data type.

Type: string

Timestamp format

Specify a string that defines the format for fields that have the Timestamp data type.

Type: string

Implicit format properties

Specify the file syntax for implicit files.

Type: category

Data format

Specify the type of implicit file

Type: selection
Default: Binary
Values:
- Binary

Record limit

Specify the maximum number of records to read from the file per node. If a value is not specified for this property, the entire file is read.

Type: integer

Encoding

Specify the encoding of the files to read or write, for example, UTF-8.

Type: string

For more information, see the File encoding topic.

Include byte order mark

Specify whether to include a byte order mark in the file when the file encoding is a Unicode encoding such as UTF-8, UTF-16, or UTF-32.

Type: boolean
Default: false

Record definition

Type: selection
Default: None
Values:
- None
- File header
- Delimited string
- Delimited string in a file
- Schema file

Definition source

Enter a delimited string that specifies the names and data types and length of each fields. Use the format name:data_type[length], and separate each field with the delimiter specified as the >B<Field delimiter>/B< property. If the record definition is in a delimited string file or Osh schema file, specify the full path of the file.

Type: string

For more information, see the Metadata formatting options topic.

First row is header

Type: boolean
Default: false

Include data types

Select Yes to append the data type to each field name that the connector writes in the first row of the output.

Type: boolean
Default: false

Trace file

Specify the full path to a file to contain trace information from the parser for delimited files. Because writing to a trace file requires additional processing, specify a value for this property only during job development.

Type: string

Create or Use existing hive table

Select Yes to create or use an existing Hive table after data has been loaded to HDFS.

Type: boolean
Default: false

Use staging table

Set Yes to use staging table. This option will be enabled only when the FileFormat is Delimited

Type: boolean
Default: false

Target table properties

Type: category

Table format

Use this property to set the format of the target table

Type: selection
Default: parquet
Values:
- orc
- parquet

ORC compression

Use this property to set the compression type for the target table when the table format is ORC

Type: selection
Default: ZLIB
Values:
- NONE
- ZLIB
- SNAPPY

Parquet compression

Use this property to set the compression type for the target table when the table format is Parquet

Type: selection
Default: SNAPPY
Values:
- NONE
- SNAPPY
- GZIP
- LZO

Stripe size

Stripe size

Type: integer
Default: 64

Table type

Use this property to set the format of the type of the target table.

Type: selection
Default: External
Values:
- External
- Internal

Location

Use this property to set the location of the HDFS files serving as storage for the Hive table

Type: string

Drop staging table

Use this property to drop the staging table. By default, the staging table would be dropped once the target table has been created. In case, the user do not want the staging table to be removed, set the value of this property to No

Type: boolean
Default: true

Load into existing table

Use the properties in this category while loading the data into an existing hive table. The table can be partitioned or non-partitioned.

Type: category

Maximum number of dynamic partitions

Use this property to set the maximum number of Dynamic paritions to be created while loading into a partitioned table.

Type: integer
Default: 1000

Enable SSL

Select Yes if SSL is enabled on the Hive server.

Type: boolean
Default: false

SSL Truststore certificates

SSL truststore (X.509) certificates in the PEM format.

Type: string

Use Kerberos

Select Yes to use Kerberos

Type: boolean
Default: false

Use keytab

Select Yes to use keytab instead of password for Kerberos login

Type: boolean

Hive host

Enter the name of the host. If not specified, the value specified in Host name will be used.

Type: string

Hive port

Enter the port for Hive.

Type: integer

Hive user name

Enter the name of the user to connect to Hive as.

Type: string

Hive password

Enter the password for the specified user.

Type: protected string

Hive keytab

Enter the fully qualified path of the keytab for the specified user.

Type: string

Hive service principal

Use this property to specify the service principal for hive. Service principal for the hive service is of the format hive/FQDN@REALM

Type: string

Table

Enter the name of the table to create.

Type: string

Hive table type

Specify Hive table type, as external (default) or internal.

Type: selection
Default: External
Values:
- External
- Internal

Create schema

Specify Yes to create the schema indicated in the fully qualified table name if it does not already exist. If Yes is specified and the table name does not contain a schema, the job will fail. If Yes is specified and the schema already exists, the job will not fail.

Type: boolean
Default: false

Drop existing table

Specify Yes to drop the Hive table if it already exists. No to append to existing Hive table.

Type: boolean
Default: true

Additional driver attributes

Specify any additional driver-specific connection attributes. Enter the attributes in the name=value format, separated by semi-colon if multiple attributes needs to be specified. For information about the supported driver-specific attributes, refer to the Progress DataDirect driver documentation.

Type: string