Operator HDFS2FileSource
SPL standard and specialized toolkits > com.ibm.streamsx.hdfs 3.1.0 > com.ibm.streamsx.hdfs > HDFS2FileSource
The HDFS2FileSource operator reads files from a Hadoop Distributed File System (HDFS).
The operator opens a file on HDFS and sends out its contents in tuple format on its output port.
If the optional input port is not specified, the operator reads the HDFS file that is specified in the file parameter and provides the file contents on the output port. If the optional input port is configured, the operator reads the files that are named by the attribute in the tuples that arrive on its input port and places a punctuation marker between each file.
Behavior in a consistent region
The HDFS2FileSource operator can participate in a consistent region. The operator can be at the start of a consistent region if there is no input port.
The operator supports periodic and operator-driven consistent region policies. If the consistent region policy is set as operator driven, the operator initiates a drain after a file is fully read. If the consistent region policy is set as periodic, the operator respects the period setting and establishes consistent states accordingly. This means that multiple consistent states can be established before a file is fully read.
At checkpoint, the operator saves the current file name and file cursor location. If the operator does not have an input port, upon application failures, the operator resets the file cursor back to the checkpointed location, and starts replaying tuples from the cursor location. If the operator has an input port and is in a consistent region, the operator relies on its upstream operators to properly reply the filenames for it to re-read the files from the beginning.
Exceptions
- The operator cannot connect to HDFS.
- The file cannot be opened.
- The file does not exist.
- The file becomes unreadable.
- A tuple cannot be created from the file contents (such as a problem with the file format).
Summary
- Ports
- This operator has 1 input port and 1 output port.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 10 parameters.
Optional: authKeytab, authPrincipal, blockSize, configPath, credFile, encoding, file, hdfsUri, hdfsUser, initDelay
- Metrics
- This operator reports 1 metrics.
Properties
- Implementation
- Java
- Ports (0)
-
The HDFS2FileSource operator has one optional input port. If an input port is specified, the operator expects an input tuple with a single attribute of type rstring. The input tuples contain the file names that the operator opens for reading. The input port is non-mutating.
- Properties
-
- Optional: true
- ControlPort: false
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- Java operators do not support output assignments.
- Ports (0)
-
The HDFS2FileSource operator has one output port. The tuples on the output port contain the data that is read from the files. The operator supports two modes of reading. To read a file line-by-line, the expected output schema of the output port is tuple<rstring line> or tuple<ustring line>. To read a file as binary, the expected output schema of the output port is tuple<blob data>. Use the blockSize parameter to control how much data to retrieve on each read. The operator includes a punctuation marker at the conclusion of each file. The output port is mutating.
- Properties
-
- Optional: false
- WindowPunctuationOutputMode: Generating
Optional: authKeytab, authPrincipal, blockSize, configPath, credFile, encoding, file, hdfsUri, hdfsUser, initDelay
- authKeytab
This parameter specifies the file that contains the encrypted keys for the user that is specified by the authPrincipal parameter. The operator uses this keytab file to authenticate the user. The keytab file is generated by the administrator. You must specify this parameter to use Kerberos authentication.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- authPrincipal
This parameter specifies the Kerberos principal that you use for authentication. This value is set to the principal that is created for the InfoSphere Streams instance owner. You must specify this parameter if you want to use Kerberos authentication.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- blockSize
This parameter specifies the maximum number of bytes to be read at one time when reading a file into binary mode (ie, into a blob); thus, it is the maximum size of the blobs on the output stream. The parameter is optional, and defaults to 4096.
- Properties
-
- Type: int32
- Cardinality: 1
- Optional: true
- configPath
This parameter specifies the absolute path to the directory that contains the core-site.xml file, which is an HDFS configuration file. If this parameter is not specified, the operator searches the default location for the core-site.xml file.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- credFile
This parameter specifies a file that contains login credentials. The credentials are used to connect to GPFS remotely by using the webhdfs://hdfshost:webhdfsport schema. The credentials file must contain information about how to authenticate with IBM InfoSphere BigInsights when using the webhdfs schema. For example, the file must contain the user name and password for an IBM InfoSphere BigInsights user.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- encoding
This parameter specifies the encoding to use when reading files. The default value is UTF-8.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- file
This parameter specifies the name of the file that the operator opens and reads. This parameter must be specified when the optional input port is not configured. If the optional input port is used and the file name is specified, the operator generates an error.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hdfsUri
- This parameter specifies the uniform resource identifier (URI) that you can use to connect to the HDFS file system. The URI has the following format:
- To access HDFS locally or remotely, use hdfs://hdfshost:hdfsport
- To access GPFS locally, use gpfs:///.
- To access GPFS remotely, use webhdfs://hdfshost:webhdfsport
If this parameter is not specified, the operator expects that the HDFS URI is specified as the fs.defaultFS or fs.default.name property in the core-site.xml HDFS configuration file. The operator expects the core-site.xml file to be in $HADOOP_HOME/../hadoop-conf or $HADOOP_HOME/etc/hadoop.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- hdfsUser
This parameter specifies the user ID to use when you connect to the HDFS file system. If this parameter is not specified, the operator uses the instance owner to connect to the HDFS file system.
When you use Kerberos authentication, the operator authenticates with the Hadoop file system as the instance owner by using the values that are specified in the authPrincipal and authKeytab parameters. After successful authentication, the operator uses the user ID that is specified by the hdfsUser parameter to perform all other operations on the file system.
NOTE: When using Kerberos authentication, the InfoSphere Streams instance owner must have super user privileges on HDFS or GPFS to perform operations as the user that is specified by the hdfsUser parameter.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- initDelay
This parameter specifies the time to wait in seconds before the operator reads the first file. The default value is 0.
- Properties
-
- Type: float64
- Cardinality: 1
- Optional: true
- HDFS2FileSource
-
stream<${streamType}> ${streamName} = HDFS2FileSource() { param file : "${filename}"; }
- HDFS2FileSource with hdfsUser and hdfsUri
-
stream<${streamType}> ${streamName} = HDFS2FileSource() { param file: "${filename}"; hdfsUser: "${hdfsUser}"; hdfsUri: "${hdfsUri}"; }
- nFilesOpened - Counter
-
The number of files that are opened by the operator for reading data.
- Java operator class library
- Library Path: ../../impl/java/bin, ../../impl/lib/BigData.jar
- apache library
- Library Path: @HADOOP_HOME@/../hadoop-conf, @HADOOP_HOME@/etc/hadoop, @HADOOP_HOME@/conf, @HADOOP_HOME@/share/hadoop/hdfs/*, @HADOOP_HOME@/share/hadoop/common/*, @HADOOP_HOME@/share/hadoop/common/lib/*, @HADOOP_HOME@/lib/*, @HADOOP_HOME@/client/*, @HADOOP_HOME@/*, @HADOOP_HOME@/../hadoop-hdfs/*