Big Data File stage: Options category

In the Options category of the input link properties for the Big Data File stage, you can specify whether the first line of the file contains column names and whether to partition the imported data set. You can also specify options for rejected records, missing file, reporting progress, reading first rows, and so on.

First Line is Column Names

Specifies that the first line of the file contains column names. This property is false by default.

Missing file mode

Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.

Keep file partitions

Set this to True to partition the imported data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.

Reject mode

Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report progress

Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.

File name column

This is an optional property. It adds an extra column of type VarChar to the output of the stage, containing the pathname of the file the record is read from. You should also add this column manually to the Columns definitions to ensure that the column is not dropped if you are not using runtime column propagation, or it is turned off at some point.

Read first rows

Specify a number n so that the stage only reads the first n rows from the file.

Row number column

This is an optional property. It adds an extra column of type unsigned BigInt to the output of the stage, containing the row number. You must also add the column to the columns tab, unless runtime column propagation is enabled.

Number Of readers per node

This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file. The starting record location in the file for each operator, or seek location, is determined by the data file size, the record length, and the number of instances of the operator, as specified by numReaders.

The resulting data set contains one partition per instance of the file read operator, as determined by numReaders.

This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP system.

Shows multiple readers on one node being used to effectively partition a sequential file

Read from multiple nodes

This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes. This can improve performance on a cluster system.

InfoSphere® DataStage® knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates the reader on each node a separate region within the file to process. The regions will be of roughly equal size.

Shows multiple nodes being used top partition a sequential file

Schema file

This is an optional property. By default the stage uses the column definitions defined on the Columns and Format tabs as a schema for reading the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a schema file.

Strip BOM

Set this property TRUE to drop the UTF-16 Endianess byte order mark when reading data. By default, this property is set to FALSE.