Defining build stages

You define a Build stage to enable you to provide a custom operator that can be executed from a parallel job stage.

The stage will be available to all jobs in the project in which the stage was defined. You can make it available to other projects using the InfoSphere® DataStage® Export facilities. The stage is automatically added to the job palette.

When defining a Build stage you provide the following information:

Description of the data that will be input to the stage.
Whether records are transferred from input to output. A transfer copies the input record to the output buffer. If you specify auto transfer, the operator transfers the input record to the output record immediately after execution of the per record code. The code can still access data in the output buffer until it is actually written.
Any definitions and header file information that needs to be included.
Code that is executed at the beginning of the stage (before any records are processed).
Code that is executed at the end of the stage (after all records have been processed).
Code that is executed every time the stage processes a record.
Compilation and build details for actually building the stage.

Note that the custom operator that your build stage executes must have at least one input data set and one output data set.

The Code for the Build stage is specified in C++. There are a number of macros available to make the job of coding simpler (see "Build Stage Macros". There are also a number of header files available containing many useful functions, see Header files.

When you have specified the information, and request that the stage is generated, InfoSphere DataStage generates a number of files and then compiles these to build an operator which the stage executes. The generated files include:

Header files (ending in .h)
Source files (ending in .c)
Object files (ending in .so)

The following shows a build stage in diagrammatic form:

Do one of:
1. Choose File > New from the Designer menu. The New dialog box appears.
2. Open the Stage Type folder and select the Parallel Build Stage Type icon.
3. Click OK. TheStage Type dialog box appears, with the General page on top.
  Or:
4. Select a folder in the repository tree.
5. Choose New > Other Parallel Stage Custom from the shortcut menu. The Stage Type dialog box appears, with the General page on top.
Fill in the fields on the General page as follows:
- Stage type name. This is the name that the stage will be known by to InfoSphere DataStage. Avoid using the same name as existing stages.
- Class Name. The name of the C++ class. By default this takes the name of the stage type.
- Parallel Stage type. This indicates the type of new parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting.
- Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See InfoSphere DataStage Parallel Job Developer Guide for a description of the execution mode.
- Preserve Partitioning. This shows the default setting of the Preserve Partitioning flag, which you cannot change in a Build stage. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag.
- Partitioning. This shows the default partitioning method, which you cannot change in a Build stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods.
- Collecting. This shows the default collection method, which you cannot change in a Build stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the collection methods.
- Operator. The name of the operator that your code is defining and which will be executed by the InfoSphere DataStage stage. By default this takes the name of the stage type.
- Short Description. Optionally enter a short description of the stage.
- Long Description. Optionally enter a long description of the stage.
Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release number to the stage so you can keep track of any subsequent changes.
You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field.

You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the InfoSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default InfoSphere DataStage icon for this stage.
Go to the Properties page. This allows you to specify the options that the Build stage requires as properties that appear in the Stage Properties tab. For custom stages the Properties tab always appears under the Stage page.
Fill in the fields as follows:
- Property name. The name of the property. This will be passed to the operator you are defining as an option, prefixed with `-' and followed by the value selected in the Properties tab of the stage editor.
- Data type. The data type of the property. Choose from:
  Boolean
  
  Float
  
  Integer
  
  String
  
  Pathname
  
  List
  
  Input Column
  
  Output Column
  
  If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns.
  
  If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list.
- Prompt. The name of the property that will be displayed on the Properties tab of the stage editor.
- Default Value. The value the option will take if no other is specified.
- Required. Set this to True if the property is mandatory.
- Conversion. Specifies the type of property as follows:
  -Name. The name of the property will be passed to the operator as the option value. This will normally be a hidden property, that is, not visible in the stage editor.
  
  -Name Value. The name of the property will be passed to the operator as the option name, and any value specified in the stage editor is passed as the value.
  
  -Value. The value for the property specified in the stage editor is passed to the operator as the option name. Typically used to group operator options that are mutually exclusive.
  
  Value only. The value for the property specified in the stage editor is passed as it is.
If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box.
The settings you use depend on the type of property you are specifying:
- Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category.
- If you are specifying a List category, specify the possible values for list members in the List Value field.
- If the property is to be a dependent of another property, select the parent property in the Parents field.
- Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns.
- Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar '|' separated list of conditions that are AND'ed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d.
  Click OK when you are happy with the extended properties.
Click on the Build page. The tabs here allow you to define the actual operation that the stage will perform.
The Interfaces tab enable you to specify details about inputs to and outputs from the stage, and about automatic transfer of records from input to output. You specify port details, a port being where a link connects to the stage. You need a port for each possible input link to the stage, and a port for each possible output link from the stage.

You provide the following information on the Input sub-tab:
- Port Name. Optional name for the port. The default names for the ports are in0, in1, in2 ... . You can refer to them in the code using either the default name or the name you have specified.
- Alias. Where the port name contains non-ascii characters, you can give it an alias in this column (this is only available where NLS is enabled).
- AutoRead. This defaults to True which means the stage will automatically read records from the port. Otherwise you explicitly control read operations in the code.
- Table Name. Specify a table definition in the InfoSphere DataStage Repository which describes the meta data for the port. You can browse for a table definition by choosing Select Table from the menu that appears when you click the browse button. You can also view the schema corresponding to this table definition by choosing View Schema from the same menu. You do not have to supply a Table Name. If any of the columns in your table definition have names that contain non-ascii characters, you should choose Column Aliases from the menu. The Build Column Aliases dialog box appears. This lists the columns that require an alias and let you specify one.
- RCP. Choose True if runtime column propagation is allowed for inputs to this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility.
  You provide the following information on the Output sub-tab:
- Port Name. Optional name for the port. The default names for the links are out0, out1, out2 ... . You can refer to them in the code using either the default name or the name you have specified.
- Alias. Where the port name contains non-ascii characters, you can give it an alias in this column.
- AutoWrite. This defaults to True which means the stage will automatically write records to the port. Otherwise you explicitly control write operations in the code. Once records are written, the code can no longer access them.
- Table Name. Specify a table definition in the InfoSphere DataStage Repository which describes the meta data for the port. You can browse for a table definition. You do not have to supply a Table Name. A shortcut menu accessed from the browse button offers a choice of Clear Table Name, Select Table, Create Table,View Schema, and Column Aliases. The use of these is as described for the Input sub-tab.
- RCP. Choose True if runtime column propagation is allowed for outputs from this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility.
  The Transfer sub-tab allows you to connect an input buffer to an output buffer such that records will be automatically transferred from input to output. You can also disable automatic transfer, in which case you have to explicitly transfer data in the code. Transferred data sits in an output buffer and can still be accessed and altered by the code until it is actually written to the port.
  
  You provide the following information on the Transfer tab:
- Input. Select the input port to connect to the buffer from the drop-down list. If you have specified an alias, this will be displayed here.
- Output. Select the output port to transfer input records from the output buffer to from the drop-down list. If you have specified an alias, this will be displayed here.
- Auto Transfer. This defaults to False, which means that you have to include code which manages the transfer. Set to True to have the transfer carried out automatically.
- Separate. This is False by default, which means this transfer will be combined with other transfers to the same port. Set to True to specify that the transfer should be separate from other transfers.
  The Logic tab is where you specify the actual code that the stage executes.
  
  The Definitions sub-tab allows you to specify variables, include header files, and otherwise initialize the stage before processing any records.
  
  The Pre-Loop sub-tab allows you to specify code which is executed at the beginning of the stage, before any records are processed.
  
  The Per-Record sub-tab allows you to specify the code which is executed once for every record processed.
  
  The Post-Loop sub-tab allows you to specify code that is executed after all the records have been processed.
  
  You can type straight into these pages or cut and paste from another editor. The shortcut menu on the Pre-Loop, Per-Record, and Post-Loop pages gives access to the macros that are available for use in the code.
  
  The Advanced tab allows you to specify details about how the stage is compiled and built. Fill in the page as follows:
- Compile and Link Flags. Allows you to specify flags that are passed to the C++ compiler.
- Verbose. Select this check box to specify that the compile and build is done in verbose mode.
- Debug. Select this check box to specify that the compile and build is done in debug mode. Otherwise, it is done in optimize mode.
- Suppress Compile. Select this check box to generate files without compiling, and without deleting the generated files. This option is useful for fault finding.
- Base File Name. The base filename for generated files. All generated files will have this name followed by the appropriate suffix. This defaults to the name specified under Operator on the General page.
- Source Directory. The directory where generated .c files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator.
- Header Directory. The directory where generated .h files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator.
- Object Directory. The directory where generated .so files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator.
- Wrapper directory. The directory where generated .op files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator.
When you have filled in the details in all the pages, click Generate to generate the stage. A window appears showing you the result of the build.