transform: syntax and options
The syntax and options of the transform operator.
Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
transform
-fileset fileset_description
-table -key field [ci | cs]
[-key field [ci | cs] ...]
[-allow_dups]
[-save fileset_descriptor]
[-diskpool pool]
[-schema schema | -schemafile schema_file]
[-argvalue job_parameter_name= job_parameter_value ...][-collation_sequence locale |
collation_file_pathname | OFF]
[-expression expression_string | -expressionfile expressionfile_path ]
[-maxrejectlogs integer]
[-sort [-input | -output [ port ] -key field_name
sort_key_suboptions ...]
[-part [-input | -output [port] -key field_name part_key_suboptions ...]
[-flag {compile | run | compileAndRun} [ flag_compilation_options ]]
[-inputschema schema | -inputschemafile schema_file ]
[-outputschema schema | -outputschemafile schema_file ]
[-reject [-rejectinfo reject_info_column_name_string]]
[-oldnullhandling]
[-abortonnull]
Where:
sort_key_suboptions are:
[-ci | -cs] [-asc | -desc] [-nulls {first | last}] [-param params ]
part_key_options are:
[-ci | -cs] [-param params ]
flag_compilation_options are:
[-dir dir_name_for_compilation ] [-name library_path_name ]
[-optimize | -debug] [-verbose] [-compiler cpath ]
[-staticobj absolute_path_name ] [-sharedobj absolute_path_name ] [-t options ]
[compileopt options] [-linker lpath] [-linkopt options ]
The -table and -fileset options allow you to use conditional lookups.
- the field names given to the -inputschema and -outputschema options and the ustring values
- -inputschemafile and -outputschemafile files
- -expression option string and the -expressionfile option filepath
- -sort and -part key-field names
- -compiler, -linker, and -dir pathnames
- -name file name
- -staticobj and -sharedobj pathnames
- -compileopt and -linkopt pathnames
Option Use -abortonnull -abortonnull Specify this option to have a job stopped when an unhandled null is encountered. You can then locate the field and record that contained the null in the job log. If you specify this option together with the -oldnullhandling option, then any nulls that occur in input fields used in output field derivations that are not explicitly handled by the expression cause the job to stop. If you specify the -abortonnull option without specifying the -oldnullhandling option, then only operations such as attempting to set a non-nullable field to null cause the job to stop.
-argvalue -argvalue job_parameter_name = job_parameter_value This option is similar to the -params top-level osh option, but the initialized variables apply to a transform operator rather than to an entire job. The global variable given by job_parameter_name is initialized with the value given by job_parameter_value.
In your osh script, you reference the job_parameter_value with [& job_parameter_name ] where the job_parameter_value component replaces the occurrence of [& job_parameter_name ].
-collation_sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can:
- Specify a predefined IBM® ICU locale
- Write your own collation sequence using ICU syntax, and supply its collation_file_pathname
- Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence.
By default, InfoSphere® DataStage® sorts strings using byte-wise comparisons.
For more information, reference this IBM ICU site:
http://oss.software.ibm.com/icu
/userguide/Collate_Intro.htm
-expression -expression expression_string This option lets you specify expressions written in the Transformation Language. The expression string might contain multi-byte Unicode characters.
Unless you choose the -flag option with run, you must use either the -expression or -expressionfile option.
The -expression and -expressionfile options are mutually exclusive.
-expressionfile -expressionfile expression_file This option lets you specify expressions written in the Transformation Language. The expression must reside in an expression_file, which includes the name and path to the file which might include multi-byte Unicode characters. Use an absolute path, or by default the current UNIX directory. Unless you choose the -flag option with run, you must choose either the -expression or -expressionfile option.
The -expressionfile and -expression options are mutually exclusive.
-flag -flag {compile | run | compileAndRun} suboptions compile: This option indicates that you wish to check the Transformation Language expression for correctness, and compile it. An appropriate version of a C++ compiler must be installed on your computer. Field information used in the expression must be known at compile time; therefore, input and output schema must be specified.
run: This option indicates that you wish to use a pre-compiled version of the Transformation Language code. You do not need to specify input and output schemas or an expression because these elements have been supplied at compile time. However, you must add the directory containing the pre-compiled library to your library search path. This is not done by the transform operator.You must also use the -name suboption to provide the name of the library where the pre-compiled code resides.
compileAndRun: This option indicates that you wish to compile and run the Transformation Language expression. This is the default value. An appropriate version of a C++ compiler must be installed on your computer.
You can supply schema information in the following ways:- You can omit all schema specifications. The transform operator then uses the up-stream operator's output schema as its input schema, and the schema for each output data set contains all the fields from the input record plus any new fields you create for a data set.
- You can omit the input data set schema, but specify schemas for all output data sets or for selected data sets. The transform operator then uses the up-stream operator's output schema as its input schema. Any output schemas specified on the command line are used unchanged, and output data sets without schemas contain all the fields from the input record plus any new fields you create for a data set.
- You can specify an input schema, but omit all output schemas or omit some output schemas. The transform operator then uses the input schema as specified. Any output schemas specified on the command line are used unchanged, and output data sets without schemas contain all the fields from the input record plus any new fields you create for a data set.
-flag (continued) - The flag option has the following suboptions:
-dir dir_name lets you specify a compilation directory. By default, compilation occurs in the TMPDIR directory or, if this environment variable does not point to an existing directory, to the /tmp directory. Whether you specify it or not, you must make sure the directory for compilation is in the library search path.
-name file_name lets you specify the name of the file containing the compiled code. If you use the -dir dir_name suboption, this file is in the dir_name directory.
- The following examples show how to use the -dir and -name options
in an osh command line:
For development:
osh "transform -inputschema schema -outputschema schema -expression expression -flag compile - dir dir_name -name file_name "
For your production machine:
osh "... | transform -flag run -name file_name | ..."
The library file must be copied to the production machine.
-flag compile and -flag compileAndRun have these additional suboptions:
-optimize specifies the optimize mode for compilation.
-debug specifies the debug mode for compilation.
- -verbose causes verbose messages to be output during compilation.
-compiler cpath lets you specify the compiler path when the compiler is not in the default directory. The default compiler path for each operating system is:
Solaris: /opt/SUNPRO6/SUNWspro/bin/CC AIX®: /usr/vacpp/bin/xlC_r Tru64: /bin/cxx HP-UX: /opt/aCC/bin/aCC
-staticobj absolute_path_name -sharedobj absolute_path_name These two suboptions specify the location of your static and dynamic-linking C-object libraries. The file suffix can be omitted. See External global C-function support for details.
-compileopt options lets you specify additional compiler options. These options are compiler-dependent. Pathnames might contain multi-byte Unicode characters.
-linker lpath lets you specify the linker path when the linker is not in the default directory. The default linker path of each operating system is the same as the default compiler path listed above.
-linkopt options lets you specify link options to the compiler. Pathnames might contain multi-byte Unicode characters.
-inputschema -inputschema schema Use this option to specify an input schema. The schema might contain multi-byte Unicode characters. An error occurs if an expression refers to an input field not in the input schema.
The -inputschema and the -inputschemafile options are mutually exclusive.
The -inputschema option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -inputschema or the -inputschemafile option. See the -flag option description in this table for information on the -compile suboption.
-inputschemafile -inputschemafile schema_file Use this option to specify an input schema. An error occurs if an expression refers to an input field not in the input schema. To use this option, the input schema must reside in a schema_file, where schema_file is the name and path to the file which might contain multi-byte Unicode characters. You can use an absolute path, or by default the current UNIX directory.
The -inputschemafile and the -inputschema options are mutually exclusive.
The -inputschemafile option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -inputschema or the -inputschemafile option. See the -flag option description in this table for information on the -compile suboption.
-maxrejectlogs -maxrejectlogs integer An information log is generated every time a record is written to the reject output data set. Use this option to specify the maximum number of output reject logs the transform option generates. The default is 50. When you specify -1 to this option, an unlimited number of information logs are generated.
-oldnullhandling -oldnullhandling Use this option to reinstate old-style null handling. This setting means that, when you use an input field in the derivation expression of an output field, you have to explicitly handle any nulls that occur in the input data. If you do not specify such handling, a null causes the record to be dropped or rejected. If you do not specify the -oldnullhandling option, then a null in the input field used in the derivation causes a null to be output.
-outputschema -outputschema schema Use this option to specify an output schema. An error occurs if an expression refers to an output field not in the output schema.
The -outputschema and -outputschemafile options are mutually exclusive.
The -outputschema option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -outputschema or the -outputschemafile option. See the -flag option description in this table for information on the -compile suboption. For multiple output data sets, repeat the -outputschema or -outputschemafile option to specify the schema for all output data sets.
-outputschemafile -outputschemafile schema_file Use this option to specify an output schema. An error occurs if an expression refers to an output field not in the output schema. To use this option, the output schema must reside in a schema_file which includes the name and path to the file. You can use an absolute path, or by default the current UNIX directory.
The -outputschemafile and the -outputschema options are mutually exclusive.
The -outputschemafile option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -outputschema or the -outputschemafile option. See the -flag option description in this table for information on the -compile suboption. For multiple output data sets, repeat the -outputschema or -outputschemafile option to specify the schema for all output data sets.
-part -part {-input | -output[ port ]} -key field_name [-ci | -cs] [-param params ] You can use this option 0 or more times. It indicates that the data is hash partitioned. The required field_name is the name of a partitioning key.
Exactly one of the suboptions -input and -output[ port ] must be present. These suboptions determine whether partitioning occurs on the input data or the output data. The default for port is 0. If port is specified, it must be an integer which represents an output data set where the data is partitioned.
The suboptions to the -key option are -ci for case-insensitive partitioning, or -cs for a case-sensitive partitioning. The default is case-sensitive. The -params suboption is to specify any property=value pairs. Separate the pairs by commas (,).
-reject -reject [-rejectinfo reject_info_column_name_string] This is optional. You can use it only once.
When a null field is used in an expression, this option specifies that the input record containing the field is not dropped, but is sent to the output reject data set.
The -rejectinfo suboption specifies the column name for the reject information.
-sort -sort {-input | -output [ port ]} -key field_name [-ci | -cs] [-asc | -desc] [-nulls {first | last}] [-param params ] You can use this option 0 or more times. It indicates that the data is sorted for each partition. The required field_name is the name of a sorting key.
Exactly one of the suboptions -input and -output[ port ] must be present. These suboptions determine whether sorting occurs on the input data or the output data. The default for port is 0. If port is specified, it must be an integer that represents the output data set where the data is sorted.
You can specify -ci for a case-insensitive sort, or -cs for a case-sensitive sort. The default is case-sensitive.
You can specify -asc for an ascending order sort or -desc for a descending order sort. The default is ascending.
You can specify -nulls {first | last} to determine where null values should sort. The default is that nulls sort first.
You can use -param params to specify any property = value pairs. Separate the pairs by commas (,).
-table -table -key field [ci | cs] [-key field [ci | cs] ...] [-allow_dups] [-save fileset_descriptor] [-diskpool pool] [-schema schema | -schemafile schema_file] Specifies the beginning of a list of key fields and other specifications for a lookup table. The first occurrence of -table marks the beginning of the key field list for lookup table1; the next occurrence of -table marks the beginning of the key fields for lookup table2, and so on For example:
lookup -table -key field -table -key field
The -key option specifies the name of a lookup key field. The -key option must be repeated if there are multiple key fields. You must specify at least one key for each table. You cannot use a vector, subrecord, or tagged aggregate field as a lookup key.
The -ci suboption specifies that the string comparison of lookup key values is to be case insensitive; the -cs option specifies case-sensitive comparison, which is the default.
In create-only mode, the -allow_dups option causes the operator to save multiple copies of duplicate records in the lookup table without issuing a warning. Two lookup records are duplicates when all lookup key fields have the same value in the two records. If you do not specify this option, InfoSphere DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records.
In normal lookup mode, only one lookup table (specified by either -table or -fileset) can have been created with -allow_dups set.
The -save option lets you specify the name of a fileset to write this lookup table to; if -save is omitted, tables are written as scratch files and deleted at the end of the lookup. In create-only mode, -save is, of course, required.
The -diskpool option lets you specify a disk pool in which to create lookup tables. By default, the operator looks first for a "lookup" disk pool, then uses the default pool (""). Use this option to specify a different disk pool to use.
The -schema suboption specifies the schema that interprets the contents of the string or raw fields by converting them to another data type. The -schemafile suboption specifies the name of a file containing the schema that interprets the content of the string or raw fields by converting them to another data type. You must specify either -schema or -schemafile. One of them is required if the -compile option is set, but are not required for -compileAndRun or -run.
-fileset [-fileset fileset_descriptor ...] Specify the name of a fileset containing one or more lookup tables to be matched.
In lookup mode, you must specify either the -fileset option, or a table specification, or both, in order to designate the lookup table(s) to be matched against. There can be zero or more occurrences of the -fileset option. It cannot be specified in create-only mode.
Warning: The fileset already contains key specifications. When you follow -fileset fileset_descriptor by key_specifications , the keys specified do not apply to the fileset; rather, they apply to the first lookup table. For example, lookup -fileset file -key field, is the same as:
lookup -fileset file1 -table -key field