IBM Support

IBM InfoSphere DataStage: Debugging Parallel Jobs

Product Documentation


Abstract

The attached document contains information about debugging DataStage parallel jobs using the Designer client.

Content

1 Introduction


2 General Limitations
3 User Interface Issues
4  Parallel Debugger Startup Time Issues
5 Parallel Debugger Runtime Issues
6 Parallel Debugger Server Configuration


************************************************************************************

1 Introduction





1.1 How do you debug parallel jobs?

You use Debug toolbar options or the Debug menu to debug your parallel job designs. The interface is interactive and is integrated within the IBM® InfoSphere DataStage and QualityStage™ Designer. In the Debug window you can inspect field data for records meeting specified criteria at any data link of a job executing on the parallel canvas, without modifying the job.  Generally the Debug interface is similar to that used to debug server jobs.

 

You debug a parallel job by setting one or more breakpoints on the links in the job, running the job in debug mode, and examining the column data when the job stops at the breakpoints. A breakpoint is a request to pause data flow at a specific link on the stage graph, whenever a user-defined record criterion has been satisfied.  The criteria can be based on the record count, or on a field value expression such as “DSLink1.Sales > 10000”.  By default, no breakpoints are set, so jobs run to completion without pausing. 

 

You can set breakpoints before or during job execution by right-clicking on a data link between two stages, then clicking “toggle breakpoint” (so that a red circular icon appears on the link) and “edit breakpoint” to specify the break criteria.

 

Jobs must be successfully compiled before they can be debugged.

 

The steps for debugging a job from Designer are covered in detail in the IBM InfoSphere DataStage and QualityStage Parallel Job Developer's Guide. For the most current information about running your jobs in debug mode, refer to the online version of the IBM Information Server information center at: http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r7/ topic/com.ibm.swg.im.iis.ds.parjob.dev.doc/topics/debuggingparalleljobs.html .

 

1.2 Why is debugging on the parallel canvas different from debugging on the server canvas?

Stages in parallel jobs can run on multiple nodes (partition parallelism); within each node, stages generally execute simultaneously in assembly-line fashion (pipeline parallelism).  The contrast to server canvas debugging is most obvious when a breakpoint has been reached on a stage configured for parallel execution; the breakpoint criteria might be satisfied on more than one node, with multiple tabs in the Debug Window displaying field data for paused records on each affected node.  (For example, in Infosphere DataStage version 8.7, clicking the “continue” icon causes record processing to resume on all paused nodes).  Note that the record count criteria (for example, “every 10 records”) specified in the breakpoint editor dialog are applied independently on each node on which the stage is active.

Users of the server canvas debugger might also notice a difference in the timing of breakpoints reached, when a similar job is debugged on the parallel canvas.   The parallel engine sometimes inserts buffers and hidden sort operators between adjacent stages, to prevent potential deadlocks and to satisfy incoming record order prerequisites of downstream stages.   Often, a breakpoint on an upstream data link may be reached by many records in succession, before a breakpoint on a downstream link is reached; this reflects the actual timing of record processing within the parallel engine as a result of automatic buffering and sort operator insertion. 

2 General Limitations

When debugging on the parallel canvas, be aware of the following limitations.

2.1 Unsupported features

1)            You cannot use the debug function on a job that was not started in debug mode.

2)            Breakpoints cannot be set within containers (shared or local) on the parallel canvas.

3)            The 'Step to Next Link' and 'Step to Next Row' features of the server debugger are not supported.

4)            The “Break on warning” feature of the server canvas debugger is not supported.

5)            The debug mode in parallel jobs does not support source-level debugging of transform expressions or custom operators. You can only perform the same record-level data link debugging for these stages that are provided for built-in stages.

2.2 Unsupported configurations

You cannot run in debug mode when parallel jobs contain either the variable APT_EXECUTION_MODE or APT_SEQUENTIAL_MODE in the environment.  (These variables generally should not be defined in a production environment anyway; they are primarily intended to assist source level debugging, core file generation and other specialized problem diagnosis).

 

You cannot use the debug function in ITAG installations (multiple versions of Information Server or DataStage server on the same host).

3 User Interface Issues




3.1 Breakpoint icons

The presence of a breakpoint on a data link is indicated by a red circular icon.  Sometimes the icon is obscured by other graphics on the link; as a workaround, this may be fixed by moving the adjacent stages further apart on the Designer canvas, so that the link is longer.

3.2 Parameter set changes

Parameter sets, additions or modifications of parameters made through the Debug->Job Parameters dialog are not remembered when the parallel job is run. The default parameters are reinstated.

This issue applies to debug mode on both the server and parallel canvases.  It affects only jobs that use parameter sets.

3.3 Job design modifications during a debugging session

Modifying the job design while a job is being debugged is not recommended. You might see inconsistencies in the Designer client. Some parts of the Designer client, for example the job canvas and debug window, will use the modified design while others parts, for example the log view, will continue to use the unmodified design.


Be aware that the job being debugged is always the version of the job that was produced when the job was last compiled.

If, during a debugging session, you need to change the job design, halt the current debugging session. Then change the job design as required. Recompile the job before starting a new debug session.

3.4 Debug view of missing columns

If a stage is configured with a schema that contains additional (nullable) columns and the columns do not have counterparts in the input data set, the record data that is returned when in debug mode might be reordered such that the missing columns appear at the end.  The columns in the Debug Window might display in a different order from the configured schema for the stage link.  In general, the parallel engine treats collections of columns within a data set as an (unordered) set, rather than an (ordered) list.

3.5 Display of binary field data

The values of binary fields are displayed in the Debug Window through a straight hex dump, e.g. “{ 68 69 09 09 09 }” instead of the hybrid printable ASCII character/hex dump “{ 'h', 'i' ... }” provided by logged output from the Peek stage.  The function is working as designed.

3.6 Tip to stop debugging session

When the interface is stopped at a breakpoint and you do not wish to debug the job further, the menu item “Run to end” (available from a dropdown menu at the top left corner of the Debug Window) provides a quick way to resume processing with all breakpoints disabled. In this manner, the parallel job runs to completion (or aborts if a fatal error is encountered).  This also ensures that the debugging session ends cleanly, with orderly teardown of parallel engine processes.

4 Parallel Debugger Startup Time Issues




4.1 TCP connections to debugger server

If a job that is started in debug mode fails to start or appears to hang, open the job log window, or view the job log from Director.  If the error message (ID=IIS-DSTAGE-RUN-E-0523) “Debugger listener request on port <integer> failed...” is in the log, it indicates that the DataStage server was unable to open a TCP network connection with the debugger server (fdbserve) running on the same host; this connection is required to support a debugging session.  The InfoSphere Information Server administrator might need to edit the debugger server configuration file to ensure that TCP port numbers allocated to fdbserve do not conflict with those used by other processes on the host; see the section “Debugger server configuration” later in this document.

4.2 Startup timeout

If the message (ID=IIS-DSTAGE-RUN-E-0522) “Parallel debugger failed to enter checked state” is in the log, note whether there are any other error messages indicating a pre-execution problem, such as an invalid parallel engine configuration, which caused the parallel engine to abort.  If so, the pre-execution problem needs to be addressed. 

 

If that message was not in the log, the appearance of additional messages from the parallel engine, such as the IBM DSEE copyright notice (ID=IIS-DSEE-TFCN-00001) “main program: IBM WebSphere DataStage Enterprise Edition...” after the “Parallel debugger failed to enter checked state” message may indicate that a timeout was reached by the DataStage server while waiting for the parallel engine (osh) to complete its initialization; this has been observed occasionally when testing jobs that use the XML Pack, but could also conceivably occur in jobs with stages such as the Java Connector and JRule that use Common Connector libraries. 

 

The default timeout period is 120 seconds.  Based on the difference in timestamps between the IBM DSEE copyright message and that of the preceding (ID=IIS-DSTAGE-RUN-I-0121) “Parallel job initiated” message, you might need to increase this timeout by defining the environment variable DS_PXDEBUGGER_TIMEOUT to a value greater than 120. For example,

 

    DS_PXDEBUGGER_TIMEOUT=200

 

Environment variables can be added to the current job in Designer via the UI command sequence Edit->Job Properties->Parameters->Add Environment Variable.  The job must be recompiled afterwards.

 

4.3 Suppression of certain environment variables and warnings

Certain environment variables, which cause potentially large volumes of diagnostic information to be written to the Director log, are automatically suppressed when a parallel job is run under the debugger. The following is a list of variables that are suppressed:

           

    APT_API_STEP_VERBOSE

    APT_PM_SHOW_PIDS

    APT_SHOW_COMPONENT_CALLS

    CC_MSG_LEVEL

    OSH_DUMP

    OSH_ECHO

    OSH_EXPLAIN

    OSH_PRINT_SCHEMAS

    APT_COPY_TRANSFORM_OPERATOR

 

To obtain the reporting messages that are normally generated when any of the above variables are defined, you can run the same job without debugging it, then inspect or save the Director log. Note that the last variable listed above (APT_COPY_TRANSFORM_OPERATOR) is not a reporting variable, but has been disabled for a different reason (see the section “Replicating Transform Libraries” later in this document).

 

Certain implicit field conversion warnings raised when the step is checked are also suppressed when the debugger is active.  The suppressed warnings have the general form: “When binding input (or output) interface field A to B: Implicit conversion from source type C to result type D”.  To see these warnings (in jobs where they would ordinarily be raised), run the job without debugging it. 

5 Parallel Debugger Runtime Issues




5.1 Breakpoint field expression evaluation

Debugger breakpoints can be set either to trigger at specified record counts (every N records), or based on a field value expression.  Field value expressions used in breakpoints have some limitations and issues.

5.1.1 Subrecord fields in breakpoint value expressions are not handled correctly

Fields within subrecords are not handled correctly when used in a breakpoint expression:

 

     break at “DSLink1.aSubRec.aField = 101”    # doesn't work

 

However, the values within subrecord fields are displayed correctly within the Debug Window when stopped at a breakpoint that is not defined by a subrecord field value expression, e.g., at a breakpoint specified by record number.

5.1.2 Field vectors in breakpoint value expressions are not handled correctly

Vectors (arrays) of field values are not handled correctly when used in a breakpoint expression:

 

     break at “DSLink1.anArray[3] = 102”       # doesn't work

 

However, the values of field vectors are displayed correctly within the Debug Window when stopped at a breakpoint not defined by a field value vector expression, e.g., at a breakpoint specified by record number.

5.1.3 Implicit regular expressions generated for string equalities

A Debugger breakpoint that specifies a field value equality expression containing string columns is implicitly converted to a regular expression which matches the original string of characters, followed by an optional run of space characters (ASCII x20), followed by an optional run of nulls (ASCII x0).  For example:

 

      break at “DSLink1.aStr = 'brown'”

 

will stop at records with any of the following field values:

 

      DSLink1.aStr = “brown”

      DSLink1.aStr = “brown"

      DSLink1.aStr = “brown\0\0\0”

      DSLink1.aStr = “brown   \0\0\0”

 

but not at

 

      DSLink1.aStr = “browning”      # no match – has non-pad characters

      DSLink1.aStr = “brown cow”    # no match – has non-pad characters

      DSLink1.aStr = “brown\t\t”       # no match – tabs not supported as padding

      DSLink1.aStr = “brown\0\0"     # no match – padding in reverse order

 

However, implicit conversion to allow trailing pad characters is not applied when the comparison is an inequality ('<>') operator.

 

The implicit conversion was introduced after usability testing showed that it was difficult for users to specify field value breakpoints on string columns with data values that contain trailing pad characters.  If the environment variable FDB_DISABLE_BREAKPOINT_AUTO_REGEX is defined (using the Edit->Job Properties->Parameters->Add Environment Variable dialog box in the Designer client) before compiling the job, the regular expression conversion is disabled for the debugging session.

5.2 Replicating Transform Libraries

Automatic replication of user transform libraries to execution nodes in a massively parallel processing (MPP) configuration, triggered by defining the environment variable APT_COPY_TRANSFORM_OPERATOR, is not supported when the job is running in debug mode; the variable is ignored.  If automatic deployment or update of user libraries to MPP execution nodes is required, it should be performed in a preliminary run of the job not in debug mode.

6 Parallel Debugger Server Configuration

The parallel debugger server (fdbserve) is a daemon that is automatically started by DataStage server; it forks a separate worker process to handle each debugging session.  A server worker process maintains separate TCP connections with DataStage server and to the osh conductor (parallel engine), bridging communications between the two.  On startup, the daemon reads its configuration file stored at IBM/InformationServer/Server/PXEngine/etc/fdb.conf.  This hierarchically structured file may be edited by the InfoSphere Information Server administrator, using any text editor; however, it is important that a backup be made of the original version, since the server is unable to start if the configuration file contains syntax or keyword errors.

 

The section (serverconf (ports...)) in fdb.conf contains several attributes (keyword/value pairs) that may be of interest to administrators.  The default values are listed below:

 

    protocolFamily=”IPv4”

    client=7101

    engineMin=7102

    engineMax=7299

 

The value of protocolFamily specifies the network (level three) protocol to be used for TCP connections between fdbserve and DataStage server on the client side, and between fdbserve and the osh conductor on the engine side.  Note that these connections are local to the DataStage server host, i.e. they are “localhost” connections.  Valid values are “IPv4”, “IPv6”, and “AnyIP”.  “AnyIP” means that both IPv4 and IPv6 connections will be tried until one succeeds, whereas “IPv4” or “IPv6” will cause only the indicated protocol to be tried.  The administrator must ensure that the specified protocol is enabled for the host operating system.

 

The value of client specifies the TCP port number on which the fdbserve daemon will bind to and listen for client-side connections from DataStage server.  The port number must be an integer between 1 and 65535, inclusive (in addition, it is recommended that the port be greater than 1023), and must not conflict with other TCP ports in use on the host.  Since the TCP connections are local to the host, there is no requirement that the port be opened in the firewall.

 

The values of engineMin and engineMax define the lower and upper limits, respectively, of a range of TCP port numbers available for use by fdbserve worker processes which to bind to and listen for TCP connections from osh.  Both values must be integers between 1 and 65535, inclusive, and engineMax must be greater than engineMin.  A separate TCP port number will be allocated by each active fdbserve worker process (this is why the configuration must specify a range of integers); ports in the configured range will be tried in sequence until one is successfully bound.  The allocated port will be passed to the osh conductor (by DataStage server) as the value of the command line switch -fdbport.  Since the TCP connections are local to the host, there is no requirement that the ports be opened in the firewall.

[attachment "debuggingaralleljobs.htm" deleted by Jaime M Cooper/Boston/IBM]

[{"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.1;8.7;11.3","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 June 2018

UID

swg27023374