DataStage Parallel framework changes may require DataStage job modifications

Troubleshooting

Problem

This technote documents changes made in the DataStage parallel framework which may require job modifications when DataStage jobs are upgraded from earlier releases.

Resolving The Problem

IBM tries to avoid making code changes that require customers to modify their existing DataStage jobs. However, sometimes it is necessary to make such changes in order to introduce new features or to fix errors. This technote documents areas where changes have taken place in DataStage releases which may require customers to make changes to jobs that were created in earlier versions.

Partitioning and Sort Insertion

Information Server releases affected: 8.0.1 Fix Pack 1 and higher, 8.1 GA and higher, 8.5 GA

For any stage that requires data to be hash partitioned and sorted (such as Join, Merge, Difference, Compare, ChangeCapture, ChangeApply) the parallel framework automatically inserts a hash partitioner and a sort on each input link to ensure input data is partitioned and sorted properly. Prior to Information Server 8.0.1 Fix Pack 1, if the Preserve Partitioning flag was set on the input link, the parallel framework would not automatically insert the partitioner or sort. Not re-partitioning or re-sorting could result in unexpected results because input data might be partitioned and sorted using different keys from those specified by the stage.

To avoid this problem, the parallel framework was changed in Information Server 8.0.1 Fix Pack 1 so that a hash partitioner and sort would automatically be inserted even in the presence of a (framework inserted) Preserve Partitioning flag, but not in the case of user-specified partitioning, as the latter takes higher precedence. However, the problem could still occur if the user-specified partitioning and sort keys don't match those required by the stage. Here are some example scenarios that may experience problems as a result of this change:

A Join stage has two keys "A" and "B". The user explicitly specifies a hash partitioning method and inserts a sort stage on the producing side of the primary link. The hash key is "A", and the sort keys are "A" and "B". Input data of the reference link has been partitioned or sorted upstream or in another job. The partitioning method on both the primary link and the reference link of Join is set to Auto. When the parallel framework analyzes partitioning and sort requirements at job startup time, it inserts hash and tsort stages on the reference link using the same two keys as specified by Join, and keeps what the user has defined on the primary link. This can cause data to be distributed to wrong partitions.
A Join stage has one key. The user explicitly specifies a hash partitioning method and inserts a sort stage upstream of the primary link. The hash and sort key is "A" with the case-sensitive property. Input data of the reference link is not pre-partitioned and pre-sorted. The partitioning method on both the primary link and the reference link of Join is set to Auto. When the parallel framework analyzes partitioning and sort requirements at job startup time, it inserts a hash partitioner on both links and the hash key does not have the case-sensitive property. The framework also inserts a tsort on the reference link, but not on the primary link because data has already been sorted. This can break the sort order of input data on the primary link.
A sequential stage or a parallel stage running in a sequential mode will produce this warning message if its producing stage is hash partitioned: "Sequential operator cannot preserve the partitioning of the parallel data set on input port 0. "

These issues can be worked around by setting the environment job parameters APT_NO_PART_INSERTION=True and APT_NO_SORT_INSERTION=True and then modifying the job to ensure that the partitioning and sorting requirements are met by explicit insertion.

Default Decimal Separator

Information Server releases affected: 8.0.1 Fix Pack 1 and higher, 8.1 Fix Pack 1 and higher, 8.5 GA

Prior to Information Server Version 8.0.1 Fix Pack 1, the default decimal separator specified via Job Properties->Defaults was not recognized by the APT_Decimal class in the parallel framework. This caused problems for the DB2 API stage where decimals with a comma decimal point could not be processed correctly. This issue was fixed in release 8.0.1 Fix Pack 1, as APAR JR31597. The default decimal separator can be specified via a job parameter (e.g. #SEPARATOR#). However, if the job parameter does not contain any value, '#' will be taken as the decimal separator. This can cause the following error if the actual decimal separator is not '#':

"Fatal Error: APT_Decimal::assignFromString: invalid format for the source string."

If you encounter this problem after upgrading, please make sure the job parameter representing the default decimal separator contains the actual decimal separator character used by input data. If changing the job parameter is not an option, you can set the environment variable APT_FORCE_DECIMAL_SEPARATOR. The value of APT_FORCE_DECIMAL_SEPARATOR overrides the value set for the "Decimal separator" property. If more than 1 character is set for this environment variable, the decimal separator will default to a dot character, '.'

Embedded Nulls in Unicode Strings

Information Server releases affected: 8.1 Fix Pack 1 and higher, 8.5 GA

Prior to Information Server 8.1 Fix Pack 1, nulls embedded in Unicode strings were not treated as data, but rather they were treated as string terminators. This caused data after the first null to be truncated. The issue was fixed in Fix Pack 1, as APAR JR33408 for Unicode strings that were converted to or from UTF-8 strings. As a result of this change, you may observe a change in job behavior where a bounded-length string is padded with trailing nulls. These extra nulls can change the comparison result of two string fields, generate duplicate records, make data conversion fail, etc depending on the job logic. To solve this problem, the job should be modified to set APT_STRING_PADCHAR=0x20 and call Trim() in transformer stage if needed.

Null Handling at column level

Information Server releases affected: 8.1 GA and higher, 8.5 GA

In parallel jobs, nullability is checked at runtime. It is possible for the user to set a column as nullable in the DataStage Designer, but at runtime the column is actually mapped as non-nullable (to match the actual database table) for example. Prior to 8.1 GA the parallel framework issued a warning for this mismatch, but the job would potentially crash with a segmentation violation as a result. The warning was changed to a fatal error in 8.1 GA as ECASE 124987 to prevent the job from aborting with SIGSEGV. After this change, jobs that used to run with this warning present will now abort with a fatal error. For an example, this problem is often seen in the lookup stage. To solve the problem, modify the job to make sure the nullability of each input field of the lookup stage matches the nullability of the same output field of the stage which is upstream to the lookup.

Transformer Stage: Run-Time Column Propagation (RCP)

DataStage releases affected: 7.5 and higher

Information Server releases affected: 8.0 GA and higher, 8.1 GA and higher, 8.5 GA

When RCP is enabled at any DataStage 7.x release prior to 7.5, for an input field "A" which is mapped to an output field "B", both "A" and "B" are present in the output record. Starting with DataStage 7.5, it appears that "A" is simply being renamed to "B", so that only "B" appears in the output. In order to improve transform performance, a straight assignment like "B=A" is considered as renaming "A" to "B". Prior to the change, the straight assignment was considered as creating an additional field by copying "A" to "B". With this change in place, the user now needs to explicitly specify both "A" and "B" in the output schema in order to prevent "A" from being renamed to "B" and to create a new field "B". Refer to the following Transformer stage screen-shot that shows how to ensure that both values are propagated to the output link.

Transformer Stage: Decimal Assignment

Information Server releases affected: 8.0 GA and higher, 8.1 GA and higher, 8.5 GA

The parallel framework used to issue a warning if the target decimal had smaller precision and scale than the source decimal. The warning was changed to an error in Information Server 8.0 GA, and as a result the input record will be dropped if a reject link is not present. This behavior change was necessary to catch the error earlier to avoid data corruption. The user should modify the job to make sure the target decimal is big enough to hold the decimal value. Alternatively, the user can add a reject link to prevent records from being dropped.

Important: This change in behavior does not apply to any Linux platforms (Redhat, Suse or zLinux.) The parallel framework does not enable exception handling on Linux platforms, so the behavior remains the same as it was prior to 8.0 GA.

Transformer Stage: Data Conversion

Information Server releases affected: 8.0 GA and higher, 8.1 GA and higher, 8.5 GA

Prior to Information Server 8.0 GA, an invalid data conversion in the transformer would result in the following behavior:

A warning message is issued to the DataStage job log
A default value was assigned to the destination field according to its data type
The record was written to the output link.
If a reject link was present, nothing was sent to the reject link.

The behavior has changed in the 8.0 GA release when a reject link is present. Instead of the record being written to the output link with a default value, it will be written to the reject link instead. This may lead to data loss if the job is expecting those records to be passed through to the output. To get to the original behavior of passing the records through, the job would need to be modified to remove the reject link.

Note: there is an environment variable which was added along with this change, to add the capability of aborting the job. To use this option, ensure that there is no reject link and then set the environment variable APT_TRANSFORM_ABORT_ON_CONVERSION_ERROR=True. The job will now abort from an invalid data conversion scenario.

Surrogate Key Generator

Information Server releases affected: 8.0.1 Fix Pack1 and higher, 8.1 Fix Pack 1 and higher, 8.5 GA

The surrogate key stage reserves keys in blocks. Prior to Information Server 8.1 Fix Pack 1, if only one record (suppose it was value 5, because an initial value was set) was generated, the surrogate key generator would use values beginning with 6 and greater as available keys for incoming records. The surrogate key generator was changed in 8.1 Fix Pack 1, as APAR JR29667. With this change, DataStage will now consider values 1 to 4 as well as any value 6 and greater as available keys. This behavior change may cause the SCD stage to produce incorrect results in the database or generate the wrong surrogate keys for the new records of the dimension. If required, the job can be modified to revert back to the old behavior (start generating keys from the highest key value last used) by setting option 'Generate Key From Last Highest Value' to Yes. This approach however may result in gaps in used keys. It is recommended that the user understand how the key file is initialized and decide if it is necessary to modify job based on business logic.

Sequential File Format on Windows

Information Server releases affected: (Windows Platforms) 8.1 GA and higher, 8.5 GA

Prior to Information Server 8.1 GA, the default format for sequential files was Unix format which requires a newline character as the delimiter of a record. The default format for the Sequential File stage was changed to Windows format in the Information Server 8.1 GA release. Due to this change, data files previously created with UNIX format will not import properly. To solve this issue, set the environment variable APT_USE_CRLF=FALSE at the DataStage project level or within the system environment variables (requires a Windows reboot).

[{"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"8.5;8.1;8.0.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZJPZ","label":"IBM InfoSphere Information Server"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

DataStage Parallel framework changes may require DataStage job modifications

Troubleshooting

Problem

Resolving The Problem

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?