Sequential file in DataStage
Use a sequential file to read data from or write data one or more flat files.
The sequential file can have a single input link or a single output link, and a single rejects link. It executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files, however, you can configure the connector to behave differently:
- You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems.
- You can specify that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node).
The sequential file executes in parallel if writing to multiple files, but executes sequentially if writing to a single file. Each node writes to a single file, but a node can write more than one file.
After you upload sequential files as assets, IBM® DataStage® uses IBM Cloud Object Storage to store the files. All of the files read from the Sequential File stage should be present in DataStage/files/<path_given_in_stage_properties> unless the path starts with cos://, in which case it is should be in the top-level COS bucket (<path_given_in_stage_properties>). You can check the files outside of the DataStage job in the DataStage/files/<path_given_in_stage_properties> directory, unless the path starts with cos://, in which case the files should be present in the top-level COS bucket (<path_given_in_stage_properties>). If you use the Asset Browser, IBM DataStage uses the IBM COS Connector. It does not use the sequential file.
Stage tab
- Execution Mode. The connector can execute in parallel mode or sequential mode. In parallel mode the contents of the data set are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the data set are processed by the conductor node.
- Combinability mode. This is Auto by default, which allows IBM DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
- Preserve partitioning. You can select Propagate, Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is. Propagate takes the setting of the flag from the previous stage.
Input tab
The Input tab allows you to specify details about how the sequential file writes data to one or more flat files. The sequential file can have only one input link, but this can write to multiple files. You can specify sequential or parquet file format under File format.
The Target category defines file, file update mode and the write method. Below are the Target category parameters:- File
- This property defines the flat file that the incoming data will be written to. You can type in a path name. You can specify multiple files by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available properties to add box. Do this for each extra file you want to specify.
- File update mode
-
This property defines how the specified file or files are updated. The same method applies to all files being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files, or Create to create a new file. If you specify the Create property for a file that already exists you will get an error at runtime.
By default this property is set to Overwrite.
- Write method
-
This property specifies whether output is written to specified files or generated files. By default this property is set to Specific files.
The property Generate Multiple Files enables you to specify additional properties for the generated files. Under the Generate Multiple Files property, the property Exclude Partition String affects the naming of generated files. The property Force Sequential changes the type of processing.
Also under the Generate Multiple Files property, is the Key property. Use the Key property to specify a key column. You can use the key column to trigger the creation of a new file when the key value changes. If a key is not specified, file size triggers the creation of a new file. (Maximum file size is specified in the Options category.)
You can use the defaults for the Key property. If you choose to change the default setting and set Use Value in Filename to Yes, the stage applies the key column value to the file name. The use of the Use Value in Filename option does not affect the behavior of other key options such as Sort Order or Nulls Position.
- Cleanup on failure
- This is set to True by default and specifies that the stage will delete any partially written files if the stage fails for any reason. Set this to False to specify that partially written files should be left.
- First line is column names
- Specifies that the first line of the file contains column names. This property is false by default.
- Reject mode
-
This specifies what happens to any data records that are not written to a file for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease writing if any rows are rejected, or Save to send rejected rows down a reject link.
Continue is set by default.
- Schema file
- This is an optional property. By default the stage uses the column definitions defined on the Columns and Format tabs as a schema for writing to the file. You can, however, specify a file containing a schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure these match the schema file). Type in a path name or browse for a schema file.
- Max file size
-
This is an optional property. This property specifies the maximum size for a target file in megabytes (MB). When the specified maximum size is reached, another target file is generated. Alternatively, you can specify a key column in the Properties category to trigger generation of a new file.
This property is available when the stage property Write Method is set to Generate Multiple Files. If a maximum file size is not specified, then the file size is unlimited.
Output tab
The Output tab allows you to specify details about how the sequential file reads data from one or more flat files. You can specify sequential or parquet file format under File format.
The Output name drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link. The Advanced tab allows you to change the default buffering settings for the output link.
- Specific file
-
- fileName
- The name of the file. You can type in a path name.
File names that include special characters, such as colons (:), ampersands (&), or white spaces, can cause warnings, errors, or other issues.
- File pattern
- Enter a prefix of the file names to be selected.
- Keep the file partitions
- Set this to True to partition the imported data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.
- Missing file mode
- Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.
- Reject mode
- Allows you to specify behavior if a read record does not match the expected schema. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
- Report progress
- Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.
- First line is column names
- Specifies that the first line of the file contains column names. This property is false by default.
Format
The Format section on the output tab allows you to supply information about the format of the flat file or files that you are reading.
If you do not alter any of the Format settings, the stage produces a file of the following format:
- File comprises variable length columns contained within double quotes.
- All columns are delimited by a comma, except for the final column in a row.
- Rows are delimited by a UNIX newline.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are:
- Fill char. Does not apply to output links.
- Final delimiter string. Specify the string written after the last column of a record in place of the column delimiter. Enter one or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to comma and Final delimiter string to `, ` (comma space - you do not need to enter the inverted commas) all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character. DataStage skips the specified delimiter string when reading the file.
- Final delimiter. Specify the single character written after the last column
of a record in place of the field delimiter. Type a character or select one of whitespace, end,
none, null, tab, or comma. DataStage skips the specified delimiter string when reading the file. See
the following diagram for an illustration.
- whitespace. The last column of each record will not include any trailing white spaces found at the end of the record.
- end. The last column of each record does not include the field delimiter. This is the default setting.
- none. The last column of each record does not have a delimiter, used for fixed-width fields.
- null. The last column of each record is delimited by the ASCII null character.
- comma. The last column of each record is delimited by the ASCII comma character.
- tab. The last column of each record is delimited by the ASCII tab character.
- Intact. The intact property specifies an
identifier of a partial schema. A partial schema specifies that only
the column(s) named in the schema can be modified by the stage. All
other columns in the row are passed through unmodified. The file containing
the partial schema is specified in the Schema File property on the Outputs tab.
This property has a dependent property:
- Check intact. Select this to force validation of the partial schema as the file or files are imported. Note that this can degrade performance.
- Record delimiter string. Specify the string at the end of each record. Enter one or more characters. This is mutually exclusive with Record delimiter, which is the default, and record type and record prefix.
- Record delimiter. Specify the single character
at the end of each record. Type a character or select one of the following:
- UNIX Newline (the default)
- null
To specify a DOS newline, use the Record delimiter string property set to "\r" or "\n".
Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and record type.
- Record length. Select Fixed where fixed length fields are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes. This is not used by default (default files are comma-delimited).
- Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string and record type.
- Record type. Specifies that data consists
of variable-length blocked records (varying) or implicit records (implicit).
If you choose the implicit property, data is written as a stream with
no explicit record boundaries. The end of the record is inferred when
all of the columns defined by the schema have been parsed. The varying
property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS,
or VR. Data is imported by using one of these formats. These formats
are not available for export.
This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix and by default is not used.
Field Defaults
Defines default properties for columns read from the file or files. These are applied to all columns, but can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog box. A common reason to override a property for an individual column occurs when reading comma-separated values (CSV) files. CSV files often enclose fields in quotes when the fields might contain a special character, such as the field delimiter. In this case, the Quote property for the columns in question should be overridden.
Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS enabled). The available properties are:
- Actual field length. Specifies the actual number of bytes to skip if the field's length equals the setting of the null field length property.
- Delimiter. Specifies the trailing delimiter of all fields in the record.
Type an ASCII character or select one of whitespace, end, none, null, comma, or tab. DataStage skips
the delimiter when reading.
- whitespace. Whitespace characters at the end of a column are ignored, that is, are not treated as part of the column.
- end. The end of a field is taken as the delimiter, that is, there is no separate delimiter. This is not the same as a setting of `None' which is used for fields with fixed-width columns.
- none. No delimiter (used for fixed-width).
- null. ASCII Null character is used.
- comma. ASCII comma character is used.
- tab. ASCII tab character is used.
- Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma space - you do not need to enter the inverted commas) specifies each field is delimited by `, ` unless overridden for individual fields. DataStage skips the delimiter string when reading.
- Null field length. The length in bytes of a variable-length field that contains a null. When a variable-length field is read, a length of null field length in the source field indicates that it contains a null. This property is mutually exclusive with null field value.
- Null field value. Specifies the value given
to a null field if the source is set to null. Can be a number, string,
or C-type literal escape character. For example, you can represent
a byte value by \ooo, where each o is an octal digit
0 - 7 and the first o is < 4, or by \xhh, where each h is
a hexadecimal digit 0 - F. You must use this form to encode non-printable
byte values.
This property is mutually exclusive with Null field length and Actual length. For a fixed width data representation, you can use Pad char (from the general section of Type defaults) to specify a repeated trailing character if the value you specify is shorter than the fixed width of the field.
You can specify a list of null values that a column could contain that represent null. To do this you specify a separator character in the dependent Null field value separator property, and then use this separator to delimit the null values in the Null field value property. For example, if you set Null field value separator to contain the slash character (/), then you could specify NULL/null/NUL/nul to specify that any of these strings could represent a null value in this column.
- Null field value separator
This is a dependent property of Null field value. You can specify a separator that can be used in the Null field value property to specify a range of values that could represent the null value. You can specify a number, string, or C-type literal escape character (as for Null field value) as a separator, but a single character such as a comma (,) or slash (/) character is the best choice. You must only specify a separator if you specify multiple values in Null field value; specifying a separator without using it will cause a runtime error.
- Prefix bytes. You can use this option with variable-length fields.
Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte
prefix containing the field length. DataStage reads the length prefix but does not include the
prefix as a separate field in the data set it reads from the file.
This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which are used by default.
- Print field. This property is intended for use when debugging jobs. Set it
to have DataStage produce a message for every field it reads. The message has the
format:
Importing N: D
where:- N is the field name.
- D is the imported data of the field. Non-printable characters contained in D are prefixed with an escape character and written as C string literals; if the field contains binary data, it is output in octal format.
- Quote. Specifies that variable length fields are enclosed in single quotes,
double quotes, or another character or pair of characters. Choose Single or Double, or enter a
character. This is set to double quotes by default.
When reading, DataStage ignores the leading quote character and reads all bytes up to but not including the trailing quote character.
- Vector prefix. For fields that are variable length vectors, specifies that
a 1-, 2-, or 4-byte prefix contains the number of elements in the vector. You can override this
default prefix for individual vectors.
Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable length vector has a prefix, you use this property to indicate the prefix length. DataStage reads the length prefix but does not include it as a separate field in the data set. By default, the prefix length is assumed to be one byte.
Type Defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.
General
- Byte order. Specifies how multiple byte data
types (except string and raw data types) are ordered. Choose from:
- little-endian. The high byte is on the right.
- big-endian. The high byte is on the left.
- native-endian. As defined by the native format of the machine. This is the default.
- Data Format. Specifies the data representation
format of a field. Applies to fields of all data types except string,
ustring, and raw and to record, subrec or tagged fields containing
at least one field that is neither string nor raw. Choose from:
- binary
- text (the default)
A setting of binary has different meanings when applied to different data types:
- For decimals, binary means packed.
- For other numerical data types, binary means "not text".
- For dates, binary is equivalent to specifying the julian property for the date field.
- For time, binary is equivalent to midnight_seconds.
- For timestamp, binary specifies that the first integer
contains a Julian day count for the date portion of the timestamp
and the second integer specifies the time portion of the timestamp
as the number of seconds from midnight. A binary timestamp specifies
that two 32-but integers are written.
By default data is formatted as text, as follows:
- For the date data type, text specifies that the data read, contains a text-based date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system.
- For the decimal data type: a field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. The destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.
- For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): DataStage assumes that numeric fields are represented as text.
- For the time data type: text specifies that the field represents time in the text-based form %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system.
- For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system.
- Field max width. The maximum number of bytes in a column represented as a string. Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width character set, you can calculate the length exactly. If you are using variable-length character set, calculate an adequate maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.
- Field width. The number of bytes in a field
represented as a string. Enter a number. This is useful where you
are storing numbers as text. If you are using a fixed-width charset,
you can calculate the number of bytes exactly. If it's a variable
length encoding, base your calculation on the width and frequency
of your variable-width characters. Applies to fields of all data types
except date, time, timestamp, and raw; and record, subrec, or tagged
if they contain at least one field of this type.
If you specify neither field width nor field max width, numeric fields written as text have the following number of bytes as their maximum width:
- 8-bit signed or unsigned integers: 4 bytes
- 16-bit signed or unsigned integers: 6 bytes
- 32-bit signed or unsigned integers: 11 bytes
- 64-bit signed or unsigned integers: 21 bytes
- single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent)
- double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)
- Pad char. This property is ignored for output links.
- Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies to all data types except raw and ustring and record, subrec, or tagged containing no fields other than raw or ustring.
String
- Export EBCDIC as ASCII. Not relevant for output links.
- Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters.
Decimal
- Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No. The default is No.
- Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default).
- Packed. Select an option to specify what
the decimal columns contain, choose from:
- Yes to specify that the decimal fields contain data
in packed decimal format (the default). This has the following sub-properties:
Check. Select Yes to verify that data is packed, or No to not verify.
Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive sign (0xf) regardless of the fields' actual sign value.
- No (separate) to specify that they contain unpacked
decimal with a separate sign byte. This has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
- No (zoned) to specify that they contain an unpacked
decimal in either ASCII or EBCDIC text. This has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
- No (overpunch) to specify that the field has a leading
or end byte that contains a character which specifies both the numeric
value of that byte and whether the number as a whole is negatively
or positively signed. This has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
- Yes to specify that the decimal fields contain data
in packed decimal format (the default). This has the following sub-properties:
- Precision. Specifies the precision of a packed decimal. Enter a number.
- Rounding. Specifies how to round the source
field to fit into the destination decimal when reading a source field
to a decimal. Choose from:
- up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE 754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1.
- down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE 754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2.
- nearest value. Round the source column towards the nearest representable value. This mode corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, -1.4 becomes -1, -1.5 becomes -2.
- truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.
- Scale. Specifies the scale of a source packed decimal.
Numeric
- C_format. Perform non-default conversion of data from string data to a integer or floating-point. This property specifies a C-language format string used for reading integer or floating point strings. This is passed to sscanf(). For example, specifying a C-format of %x and a field width of 8 ensures that a 32-bit integer is formatted as an 8-byte hexadecimal string.
- In_format. Format string used for conversion of data from string to integer or floating-point data This is passed to sscanf(). By default, DataStage invokes the C sscanf() function to convert a numeric field formatted as a string to either integer or floating point data. If this function does not output data in a satisfactory format, you can specify the in_format property to pass formatting arguments to sscanf().
- Out_format. This property is not relevant for output links.
Date
- Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system.
- Format string. The string format of a date. By default this is %yyyy-%mm-%dd.
- Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
- Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.
- Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.
Timestamp
- Format string. Specifies the format of a column representing a timestamp as a string. The format combines the format for date strings and time strings.
Reading an entire file as a single column of output
You can set the formatting properties of the Sequential file connector to output the contents of an entire file in the form of one column of data. For example, instead of the Sequential file connector parsing out the columns and rows of a CSV file, it reads the entirety of the file as one long VARCHAR string. This feature is useful if you want to feed an entire XML file as a single record to the Hierarchical stage for processing.
- Open the properties panel for the Sequential file connector that is in your DataStage flow.
- Click the Output tab, then open the Format section of the properties.
- Select the checkbox for Read entire file as one column, then save your changes.