File encoding

You can specify the encoding of files that are read from or written to Amazon S3.

Methods for specifying the file encoding

You can specify the file encoding in the following ways, which are listed in their order of precedence:
  1. As a value for the Encoding property in the stage editor.
  2. As a value for the charset attribute in a .osh schema file. You can use this method only if runtime column propagation is enabled and the connector uses metadata from a .osh schema file.
  3. As a value for the APT_IMPEXP_CHARSET environment variable.

The character set that you specify for the file encoding must be supported by the Java Virtual Machine (JVM).

Byte order marks

The Amazon S3 connector can match byte order marks (BOM) in files to the file encoding that you specify. If the BOM in the file specifies a different endian format than the file encoding, or the file encoding does not include an endian format, the endian format from the BOM is used. For example, if the encoding is specified as UTF-16 and the BOM indicates that the file is UTF-16 big endian, the encoding is changed to UTF-16BE.

When the Amazon S3 connector reads a file in parallel across multiple nodes, only the first node reads the BOM and adjusts the encoding automatically. As a result, if you configure the Amazon S3 Connector stage to read a file in parallel, ensure that you include the endian format in the encoding that you specify. For example, specify UTF-16BE instead of UTF-16.

When you use the Amazon S3 connector to write data, you can configure the connector to include byte order marks in files by setting Include byte order mark to Yes.