APT_DUMP_SCORE report in InfoSphere DataStage parallel jobs
To analyze job performance and diagnose problems in your
jobs, you can review the report in the job log that is generated by
enabling the APT_DUMP_SCORE
environment variable.
The configuration file specifies the nature and amount of parallelism
for a job, and the specific resources that are used to run a job.
When a job is run, the data flow information in the compiled job is
combined with the information in the configuration file to produce
a detailed execution plan that is called the score. The APT_DUMP_SCORE
environment
variable is a text representation of the score (a report) that is
written to the job for the log.
- Where and how data is partitioned
- Whether InfoSphere DataStage inserted extra operators in the flow
- The degree of parallelism each operator runs with, and on which nodes
- Information about where the data is buffered
To set the APT_DUMP_SCORE
environment variable,
open the Administrator client, and then click Parallel > Reporting.
You can set the APT_DUMP_SCORE
environment variable
to true for a job, a project, or the entire system. If you set it
to true for the entire system, all parallel jobs produce the report,
which you can use in your development and test environments.
APT_DUMP_SCORE
and
then run a job, you might typically see the following text in the
job log.main_program: This step has 10 datasets:
ds0: {op0[1p] (sequential PacifBaseMCES)
eOther(APT_ModulusPartitioner {key={ value=MBR_SYS_ID }
})<>eCollectAny
op1[4p] (parallel RemDups.IndvIDs_in_Sort)}
ds1: {op1[4p] (parallel RemDups.IndvIDs_in_Sort)
[pp] eSame=>eCollectAny
op2[4p] (parallel RemDups)}
ds2: {op2[4p] (parallel RemDups)
[pp] eSame=>eCollecttAny
op6[4p] (parallel buffer (0))}
ds3: {op3[1p] (sequential PacifGalaxyMember)
eOther(APT_ModulusPartitioner {key={ value=MBR_SYS_ID }
})<>eCollectAny
op4[4p] (parallel IndvIdJoin.toIndvIdJoin_Sort)}
ds4: {op4[4p] (parallel IndvIdJoin.toIndvIdJoin_Sort)
eOther(APT_HashPartitioner) {key={ value=MBR_SYS_ID,
})#>eCollectAny
op5[4p](parallel inserted tsort operator {key={value=MBR_SYS_ID,
subArgs={asc}}}(0) in IndvIdJoin)}
ds5: {op5[4p] (parallel inserted tsort operator {key={value=MBR_SYS_ID,
subArgs={asc}}}(0) in IndvIdJoin)
[pp] eSame=>eCollectAny
op7[4p] (parallel APT_JoinSubOperatorNC in IndvIdJoin)}
ds6: {op6[4p] (parallel buffer(0))
[pp] eSame=>eCollectAny
op7[4p] (parallel APT_JoinSubOperatorNC in IndvIdJoin)}
ds7: {op7[4p] (parallel APT_JoinSubOperatorNC in IndvIdJoin)
[pp] eAny=>eCollectAny
op8[4p] (parallel
APT_TransformOperatorImplV22S14_ETLTek_HP37FMember_PMR64262_Test1_SplitTran2
in SplitTran2)}
ds8: {op8[4p] (parallel
APT_TransformOperatorImplV22S14_ETLTek_HP37FMember_PMR64262_Test1_SplitTran2
in SplitTran2)
eSame=>eCollectAny
op9[4p] (parallel buffer (1))}
ds9: {op9[4p] (parallel buffer(1))
>>eCollectOther(APT_SortedMergeCollector { key={ value=MBR_SYS_ID,
subArgs={ asc }
}
})
op10[1p] (sequential APT_RealFileExportOperator in
HP37_OvaWestmember_extract_dat)}
It has 11 operators:
op0[1p] {(sequential PacifBaseMCES)
on nodes (
node1[op0,p0]
)}
op1[4p] {(parallel RemDups.IndvIDs_in_Sort)
on nodes (
node1[op1,p0]
node2[op1, p1]
node3[op1, p2]
node4[op1, p3]
)}
op2[4p] {(parallel RemDups)
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
node4[op2,p3]
)}
op3[1p] {(sequential PacifGalaxyMember)
on nodes (
node2[op3,p0]
)}
op4[4p] {(parallel IndvIdJoin.toIndvIdJoin_Sort)
on nodes (
node1[op4,p0]
node2[op4,p1]
node3[op4,p2]
node4[op4,p3]
)}
op5[4p] {(parallel inserted tsort operator {key={value=MBR_SYS_ID,
subArgs={asc}}}(0) in IndvIdJoin)
on nodes (
node1[op5,p0]
node2[op5,p1]
node3[op5,p2]
node4[op5,p3]
)}
op6[4p] {(parallel buffer(0))
on nodes (
node1[op6,p0]
node2[op6,p1]
node3[op6,p2]
node4[op6,p3]
)}
op7[4p] {(parallel APT_JoinSubOperatorNC in IndvIdJoin)op8[4p] {(parallel
on nodes(
node1[op7,p0]
node2[op7,p1]
node3[op7,p2]
node4[op7,p3]
)}
op8[4p] { (parallel
(APT_TransformOperatorImplV22S14_ETLTek_HP37FMember_PMR64262_Test1_SplitTran2
in SplitTran2)
on nodes (
node1[op8,p0]
node2[op8,p1]
node3[op8,p2]
node4[op8,p3]
)}
op9[4p] {(parallel buffer(1))
on nodes (
node1[op9,p0]
node2[op9,p1]
node3[op9,p2]
node4[op9,p3]
)}
op10[1p] {(sequential APT_RealFileExportOperator in
HP37_OvaWestmember_extract_dat)
on nodes (
node2[op10,p0]
)}
It runs 35 processes on 4 nodes.
In a typical job flow, operators are endpoints and data sets are the links between the operators. An exception is when data sets are used to output a file.
Each link on the job design might write data to a temporary data set that is passed to the next operator. These temporary data sets are only placed in the scratch disk space when an imposed limit is reached. A limit can be imposed due to environmental settings or physical memory limitations.
- The established configuration file for the job
- The node pool settings
- The operator configured settings
- The job flow environment variables, such as
APT_DISABLE_COMBINATION
, being set or not set
op0[1p] {(sequential PacifBaseMCES)
on nodes (
node1[op0,p0]
)}
op1[4p] {(parallel RemDups.IndvIDs_in_Sort)
on nodes (
node1[op1,p0]
node2[op1,p1]
node3[op1,p2]
node4[op1,p3]
)}
In the example above, the first operator is listed as PacifBaseMCES and is the stage name in its entirety. However, the second operator, is listed as remDups.IndvIDs_in_Sort. The stage name IndvIDs is renamed to indicate that the sort process triggered by the Remove Duplicates stage occurred.
is each operator name the specific nodes that the operators are tagged to run on. In the example, node1 is for the first operator, and node1, node2, node3, and node4 are for the second operator. The name of the nodes is defined in the job configuration file.
ds0: {op0[1p] (sequential PacifBaseMCES)
eOther(APT_ModulusPartitioner { key={ value=MBR_SYS_ID }
})<>eCollectAny
op1[4p] (parallel RemDups.IndvIDs_in_Sort)}
ds1: {op1[4p] (parallel RemDups.IndvIDs_in_Sort)
[pp] eSame=>eCollectAny
op2[4p] (parallel RemDups)}
sequential PacifBaseMCES
is the source of the data set - operator 0. This stage specifies that the data sets must be read sequentially, or in a specific order, by the program. The stage also specifies that the job cannot be run in a parallel structure because the user specified that the files must be read sequentially.parallel RemDups.IndvIDs_in_Sort
is the activity of the data set - operator 1. This stage specifies that the data sets can be read in a parallel structure, and therefore, the data sets can run on multiple nodes.parallel RemDups
is the target of the data set - operator 2. Theparallel RemDups
operator is the final stage in which the data set is transformed before the job is completed and the data sets complete the job.
The source and target are usually operators, although you might see a specific file name that is provided, which indicates that the operator is referencing and reading from a physical data set.
The first data set, ds0, partitions the data
from the first operator (op0 running in 1 partition).
The data set uses the APT_ModulusPartitioner
class,
which is sometimes referred to as Advanced Parallel Technology modulus,
to partition the data set. The modulus partitioning is using the
key field MBR_SYS_ID
in this scenario. The partitioned
data is being sent to the second operator (op1 running
in 4 partitions), which means that the data is partitioned in 4 partitions
using the modulus method.
The second data set, ds1, reads from the second
operator (op1 running in 4 partitions). The second
data set uses the eSame
method to partition the data
and sends the data over to the third operator (op2 running
4 partitioning). The value [pp] means preserved
partitioning. Preserved partitioning is an option that is set by default
when you define your jobs. If data must be repartitioned, the [pp] flag
is overridden and a warning message is triggered.
eOther
and eCollectAny
input
and target read methods are being used. The second method indicates
the method that the receiving operator uses to collect the data. - In this example,
eOther
is the originating or input method for op0. It is an indication that something else is being imposed outside the expected partitioning option and that you need to observe the string within the parenthesis that encloseAPT_ModulusPartitioner
. In this example, modulus partitioning is imposed. eCollectAny
is the target read method. Any records that are fed to this data set are collected in a round robin manner. The round robin behavior is less significant than the behavior that occurs for the input partitioning method, which iseOther
(APT_ModulusPartitioner
) for ds0.
APT_SortedMergeCollector
class, the eCollectOther
method
indicates where actual partitioning occurs and is specified when you
are referencing a sequential flat file. ds8: {op8[4p] (parallel
APT_TransformOperatorImplV22S14_ETLTek_HP37FMember_PMR64262_Test1_SplitTran2
in SplitTran2)
eSame=>eCollectAny
op9[4p] (parallel buffer(1))}
ds9: {op9[4p] (parallel buffer(1))
>>eCollectOther(APT_SortedMergeCollector { key={ value=MBR_SYS_ID,
subArgs={ asc }
The report uses symbols to represent
the partitioning method and read method. See table 1 for a description
of the symbols.
Symbol | Originating partitioning method | Target read method |
---|---|---|
-> |
Sequential | Sequential |
<> |
Sequential | Parallel |
=> |
Parallel | Parallel (same) |
#> |
Parallel | Parallel (not same) |
>> |
Parallel | Sequential |
> |
No source | No target |
In the example above, the op0 operator runs first in sequential mode on node node1, and sends data to the ds0 data set. The ds0 data set is partitioned using the modulus partitioning method data that is provided from sequential to parallel (4 ways). Then the data is sent to the op1 operator that is running in parallel mode on node1, node2, node3, and node4. The op1 operator then handles the collected data, and sends the results to the ds1 data set. The ds1 data just provides data in the same partitioning order for the op2 operator as it was for the op1 operator.
op5[4p] {(parallel inserted tsort operator {key={value=MBR_SYS_ID,
subArgs={asc}}}(0) in IndvIdJoin)
on nodes (
node1[op5,p0]
node2[op5,p1]
node3[op5,p2]
node4[op5,p3]
)}
ds4: {op4[4p] (parallel IndvIdJoin.toIndvIdJoin_Sort)
eOther(APT_HashPartitioner { key={ value=MBR_SYS_ID }
})#>eCollectAny
op5[4p] (parallel inserted tsort operator {key={value=MBR_SYS_ID,
subArgs={asc}}}(0) in IndvIdJoin)}
ds5: {op5[4p] (parallel inserted tsort operator {key={value=MBR_SYS_ID,
subArgs={asc}}}(0) in IndvIdJoin)
[pp] eSame=>eCollectAny
op7[4p] (parallel APT_JoinSubOperatorNC in IndvIdJoin)}
[...]
op7[4p] {(parallel APT_JoinSubOperatorNC in IndvIdJoin)
on nodes (
node1[op7,p0]
node2[op7,p1]
node3[op7,p2]
node4[op7,p3]
)}
One potential problem with this particular dump score report is
that one of the two input links for that sort stage (op7)
is partitioned using modulus order (ds0), while
the other input link is partitioned by hash partitioning (ds4).
The hash partitioning overrode the initial modulus partitioning request
(ds3). The first modulus insertion was overridden
because the engine detected that the job design did not supply the
required fields. The key fields are frequently supplied in the wrong
order, or the job uses different key fields that break the compatibility
of the data order requirements for the downstream stages. It is important
to review the APT_DUMP_SCORE
report and confirm
that your valid job design is interpreted correctly by the parallel
engine. Ensure that the intended design is correctly implemented.
op6[4p] {(parallel buffer(0))
on nodes (
node1[op6,p0]
node2[op6,p1]
node3[op6,p2]
node4[op6,p3]
)}
- The buffer operator communicates with the upstream operator to slow down it is sending of data.
- The buffer operator holds on to the data until the downstream operator is ready for the next block of data.
If your job is running slower than other jobs, look at the number of buffer operators. Buffer operators prevent race conditions between operators, which helps to ensure that the operators perform the job in the correct order to help prevent errors. Disabling buffering can cause severe problems that are difficult to analyze. However, better job design can reduce the amount of buffering that occurs.
op1[2p] {(parallel APT_CombinedOperatorController:
(APT_TransformOperatorImplV0S1_TrafoTest1_Transformer_1 in
Transformer_1)
(Peek_2)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
Data sets take up memory, and as part of optimization, jobs try to combine multiple operators that handle data in the same way as the operators would separately. For example, without any requirement to change the partition or sort order for the data flow, data is immediately handed off to the next operator when processing is completed in the prior operator with less memory impact.
In this example, two combined operators, a transform operator and a peek operator, are running on two partitions.
When the job log indicates that an error occurred in APT_CombinedOperator
,
the APT_DUMP_SCORE
report can help you identify which
of the combined operators is causing the problem. To identify the
problem, enable the environment variable APT_DISABLE_COMBINATION
.
This environment variable can help you to identify which stage has
the error.