IBM Streams 4.2.1

Operator `HDFS2FileSink`

SPL standard and specialized toolkits > com.ibm.streamsx.hdfs 3.1.0 > com.ibm.streamsx.hdfs > HDFS2FileSink

The HDFS2FileSink operator writes files to a Hadoop Distributed File System.

The HDFS2FileSink operator is similar to the FileSink operator. This operator writes tuples that arrive on its input port to the output file that is named by the file parameter. You can optionally control whether the operator closes the current output file and creates a new file for writing based on the size of the file in bytes, the number of tuples that are written to the file, or the time in seconds that the file is open for writing, or when the operator receives a punctuation marker.

Behavior in a consistent region

The HDFS2FileSink operator can participate in a consistent region. The operator can be part of a consistent region, but cannot be at the start of a consistent region. The operator guarantees that tuples are written to a file in HDFS at least once, but duplicated tuples can be written to the file if application failure occurs.

For the operator to support consistent region, the Hadoop Distributed File System must be configured with file append enabled. For information about how to properly enable this feature, refer to the documentation of your Hadoop distribution.

On drain, the operator flushes its internal buffer to the file. On checkpoint, the operator stores the current file name, file size, tuple count, and file number to the checkpoint. On reset, the operator closes the current file, and opens the file from checkpoint. File states like file size and tuple count are reset to the file. The file is opened in append mode, and data is written to the end of the file.

Exceptions

The HDFS2FileSink operator terminates in the following cases:

The operator cannot connect to HDFS.
The file cannot be written.

Examples

Summary

Ports: This operator has 1 input port and 1 output port.
Windowing: This operator does not accept any windowing configurations.
Parameters: This operator supports 15 parameters.
Optional: authKeytab, authPrincipal, bytesPerFile, closeOnPunct, configPath, credFile, encoding, file, fileAttributeName, hdfsUri, hdfsUser, tempFile, timeFormat, timePerFile, tuplesPerFile
Metrics: This operator does not report any metrics.

Properties

Implementation: Java

Input Ports

Ports (0)

The HDFS2FileSink operator has one input port, which writes the contents of the input stream to the file that you specified. The input port is non-mutating, and its punctuation mode is Oblivious. The HDFS2FileSink supports writing data into HDFS in two formats. For line format, the schema of the input port is tuple<rstring line> or tuple<ustring line>, which specifies a single rstring or ustring attribute that represents a line to be written to the file. For binary format, the schema of the input port is tuple<blob data>, which specifies a block of data to be written to the file.

Properties

Optional: false

ControlPort: false
WindowingMode: NonWindowed
WindowPunctuationInputMode: Oblivious

Output Ports

Assignments: Java operators do not support output assignments.

Ports (0)

The HDFS2FileSink operator is configurable with an optional output port. The output port is non-mutating and its punctuation mode is Free. The schema of the output port is <string fileName, uint64 fileSize>, which specifies the name and size of files that are written to HDFS.

Properties

Optional: true

WindowPunctuationOutputMode: Free

Parameters

Optional: authKeytab, authPrincipal, bytesPerFile, closeOnPunct, configPath, credFile, encoding, file, fileAttributeName, hdfsUri, hdfsUser, tempFile, timeFormat, timePerFile, tuplesPerFile

authKeytab

This parameter specifies the file that contains the encrypted keys for the user that is specified by the authPrincipal parameter. The operator uses this keytab file to authenticate the user. The keytab file is generated by the administrator. You must specify this parameter to use Kerberos authentication.

Properties

Type: rstring
Cardinality: 1
Optional: true

authPrincipal

This parameter specifies the Kerberos principal that you use for authentication. This value is set to the principal that is created for the InfoSphere Streams instance owner. You must specify this parameter if you want to use Kerberos authentication.

Properties

Type: rstring
Cardinality: 1
Optional: true

bytesPerFile

This parameter specifies the approximate size of the output file, in bytes. When the file size exceeds the specified number of bytes, the current output file is closed and a new file is opened. The bytesPerFile, timePerFile, and tuplesPerFile parameters are mutually exclusive; you can specify only one of these parameters at a time.

Properties

Type: int64
Cardinality: 1
Optional: true

closeOnPunct

This parameter specifies whether the operator closes the current output file and creates a new file when a punctuation marker is received. The default value is false.

Properties

Type: boolean
Cardinality: 1
Optional: true

configPath

This parameter specifies the absolute path to the configuration directory that contains the core-site.xml file. If this parameter is not specified, the operator searches the default location for the core-site.xml.

Properties

Type: rstring
Cardinality: 1
Optional: true

credFile

This parameter specifies the file that contains the login credentials. These credentials are used when you connect to GPFS remotely by using the webhdfs://hdfshost:hdfsport schema. The credentials file must contain information on how to authenticate with IBM InfoSphere BigInsights when using the webhdfs schema. For example, the file must contain the user name and password for an IBM InfoSphere BigInsights user.

Properties

Type: rstring
Cardinality: 1
Optional: true

encoding

This parameter specifies the character set encoding that is used in the output ﬁle.

Properties

Type: rstring
Cardinality: 1
Optional: true

file

This parameter specifies the name of the file that the operator writes to. The file parameter can optionally contain the following variables, which the operator evaluates at runtime to generate the file name:

%HOST The host that is running the processing element (PE) of this operator.
%FILENUM The file number, which starts at 0 and counts up as a new file is created for writing.
%PROCID The process ID of the processing element.
%PEID The processing element ID.
%PELAUNCHNUM The PE launch count.
%TIME The time when the file is created. If the timeFormat parameter is not specified, the default time format is yyyyMMdd_HHmmss.

For example, if you specify a file parameter of myFile%FILENUM%TIME.txt, and the first three files are created in the afternoon on November 30, 2014, the file names are myFile020141130_132443.txt, myfile120141130_132443.txt, and myFile220141130_132443.txt.

Important: If the %FILENUM specification is not included, the file is overwritten every time a new file is created.

Properties

Type: rstring
Cardinality: 1
Optional: true

fileAttributeName

If set, this points to an attribute containing the filename. The operator will close a file when value of this attribute changes. If the string contains substitutions, the check for a change happens before substituations, and the filename contains the substitutions based on the first tuple.

Properties

Type: rstring
Cardinality: 1
Optional: true

hdfsUri

This parameter specifies the uniform resource identifier (URI) that you can use to connect to the HDFS file system. The URI has the following format:

To access HDFS locally or remotely, use hdfs://hdfshost:hdfsport.
To access GPFS locally, use gpfs:///.
To access GPFS remotely, use webhdfs://hdfshost:webhdfsport.

If this parameter is not specified, the operator expects that the HDFS URI is specified as the fs.defaultFS or fs.default.name property in the core-site.xml HDFS configuration file. The operator expects the core-site.xml file to be located in $HADOOP_HOME/../hadoop-conf or $HADOOP_HOME/etc/hadoop.

You can use the hdfsUri parameter to override the value that is specified for the fs.defaultFS or fs.default.name option in the core-site.xml configuration file.

Properties

Type: rstring
Cardinality: 1
Optional: true

hdfsUser

This parameter specifies the user ID to use when you connect to the HDFS file system. If this parameter is not specified, the operator uses the instance owner ID to connect to the HDFS file system.

When you use Kerberos authentication, the operator authenticates with the Hadoop file system as the instance owner by using the values that are specified in the authPrincipal and authKeytab parameters. After successful authentication, the operator uses the user ID that is specified in the hdfsUser parameter to perform all other operations on the file system.

NOTE: When you use Kerberos authentication, the InfoSphere Streams instance owner must have super user privileges on HDFS or GPFS to perform operations as the user that is specified by the hdfsUser parameter.

Properties

Type: rstring
Cardinality: 1
Optional: true

tempFile

This parameter specifies the name of the file that the operator writes to. When the file is closed the file is renamed to the final filename defined by the file parameter or fileAttributeName parameter. The tempFile parameter can optionally contain the following variables, which the operator evaluates at runtime to generate the file name:

%HOST The host that is running the processing element (PE) of this operator.
%PROCID The process ID of the processing element.
%PEID The processing element ID.
%PELAUNCHNUM The PE launch count.
%TIME The time when the file is created. If the timeFormat parameter is not specified, the default time format is yyyyMMdd_HHmmss.

Important: This parameter must not be used in a consistent region.

Properties

Type: rstring
Cardinality: 1
Optional: true

timeFormat

This parameter specifies the time format to use when the file parameter value contains %TIME. The parameter value must contain conversion specifications that are supported by the java.text.SimpleDateFormat. The default format is yyyyMMdd_HHmmss.

Properties

Type: rstring
Cardinality: 1
Optional: true

timePerFile

This parameter specifies the approximate time, in seconds, after which the current output file is closed and a new file is opened for writing. The bytesPerFile, timePerFile, and tuplesPerFile parameters are mutually exclusive; you can specify only one of these parameters.

Properties

Type: float64
Cardinality: 1
Optional: true

tuplesPerFile

This parameter specifies the maximum number of tuples that can be received for each output file. When the specified number of tuples are received, the current output file is closed and a new file is opened for writing. The bytesPerFile, timePerFile, and tuplesPerFile parameters are mutually exclusive; you can specify only one of these parameters at a time.

Properties

Type: int64
Cardinality: 1
Optional: true

Code Templates

HDFS2FileSink

() as ${operatorName} = HDFS2FileSink(${inputStream} ) {
            param
                file: "${filename}";
        }

HDFS2FileSink with hdfsUser and hdfsUri

() as ${operatorName} = HDFS2FileSink(${inputStream} ) {
            param
                file: "${filename}";
                hdfsUser: "${hdfsUser}";
                hdfsUri: "${hdfsUri}";
        }

Libraries

Java operator class library: Library Path: ../../impl/java/bin, ../../impl/lib/BigData.jar
apache library: Library Path: @HADOOP_HOME@/../hadoop-conf, @HADOOP_HOME@/etc/hadoop, @HADOOP_HOME@/conf, @HADOOP_HOME@/share/hadoop/hdfs/*, @HADOOP_HOME@/share/hadoop/common/*, @HADOOP_HOME@/share/hadoop/common/lib/*, @HADOOP_HOME@/lib/*, @HADOOP_HOME@/client/*, @HADOOP_HOME@/*, @HADOOP_HOME@/../hadoop-hdfs/*

Operator HDFS2FileSink

Behavior in a consistent region

Exceptions

Summary

Properties

Operator `HDFS2FileSink`