For job designs that use connectors or BDFS to read or write data from data sources, you can use InfoSphere DataStage Balanced Optimization for greater control over the job. You design your job as you normally would, then use InfoSphere DataStage Balanced Optimization to redesign the job automatically to your stated preferences. This redesign process can maximize performance by minimizing the amount of input and output performed, and by balancing the processing against source, intermediate, and target environments. You can then examine the optimized job design and save it as a new job. Your original job design remains unchanged. InfoSphere DataStage Balanced Optimization enables you to take advantage of the power of the databases without becoming an expert in SQL and Hadoop clusters without becoming an expert in a MapReduce language, such as Jaql.
The following principles can lead to better performance of parallel jobs:
- Minimize I/O and data movement — Reduce the amount of source data read by the job by performing computations within the source data server. Where possible, move processing of data to the data server and avoid extracting data just to process it and write it back to the same data server.
- Maximize optimization within source or target data servers — Make use of the highly developed optimizations that data servers achieve by using local indices, statistics, and other specialized features.
- Maximize parallelism — Take advantage of default InfoSphere DataStage behavior when reading and writing data servers by using parallel interfaces and pipe the data through the job so data flows from source to target without being written to intermediate destinations.
InfoSphere DataStage Balanced Optimization uses these principles to improve the potential performance of a job. It does not change or optimize machine configurations, InfoSphere DataStage configurations, database configurations, or HDFS configurations.
You can influence how jobs are optimized by setting options in the InfoSphere DataStage Balanced Optimization window. Only options meaningful to the particular job being optimized are displayed. Optimization attempts to perform optimizations according to the options you select. If a job contains any unsupported functionality, it will not be optimized. Optimization options apply to the job as a whole, not to specific stages or links:
- Push processing to data targets— Select this to push processing implemented by Transformer, Sort, Aggregator, Join, and Lookup stages back into data targets where ever possible.
- Push processing to data sources— Select this to push processing implemented by Transformer, Sort, Aggregator, Remove Duplicates, Join, and Lookup stages back into data sources wherever possible.
- Push data reduction processing to data targets— Data-reduction processing involves reducing the volume of data before it is written to the data target. Aggregations and transforms that have constraint expressions set are examples of data-reduction processing. In normal circumstances, you do not want to push this processing to a target because it means that a greater volume of data is written to the target. However, pushing data-reduction processing to the target data server can be useful — for example, when the aggregation or filtering is performed on data already in the target data server.
- Push all processing into the target data server— If you have a job where all the data sources and targets are in the same data server, and the server is capable of carrying out the processing tasks, setting this option moves all the job logic to the data server. The optimized job comprises a dummy row-generator stage feeding a target database or Jaql stage. The dummy stage is required because InfoSphere DataStage does not support single-stage jobs. When the optimized job is run, all the data remains within the data server. There is no data server I/O and no actual data flows through the job. Complex SQL is generated to perform all the processing within the target data server.
- Use bulk loading of target staging tables (for database connectors only)— Select this to perform target database modification by using high-performance bulk loading of a temporary staging table in the target database. This is not available for Netezza or for Hadoop clusters. Data used to insert, update, or delete rows in the real target table is first bulk-loaded into a temporary staging table. After the staging table is bulk-loaded, SQL statements are run within the database engine (using the target connector after-SQL capability) to manage the actual destination table as originally specified and to post-process the data into its final destination table. By default, any staging table is created in the same target database space as the real target table, but you can specify a different database space for the staging table if required. By default, the staging table is always dropped if it exists, then is recreated on each run of the job. You can manually change these settings in the optimized job before you run it.
Default values for options and properties generally result in good optimization, but one may wish to change some of these to take advantage of specific environment settings. Table 1 shows the options and properties that may appear in a job involving the BDFS stage and how they should be set.
Table 1. Scope of options/property
|Option/Property||Scope||Appears when||How to set|
|Push processing to data targets||job||A database connector or BDFS stage is used as a data target||To push any processing possible into a data target (default).|
|Push data reductions to data targets||job||An aggregator or other stage that does significant data reduction appears in the job||To allow aggregations and other data-reduction processing to be pushed to a data target (default).|
|Push all processing into target data server||job||All data source and target stages in the job access the same data server||To allow the optimizer, after all other optimizations have been performed, to finally push all processing into the target data server with no data flowing in the job (default).|
|Name of a stage where optimization should stop||job||Any job with a processing stage||Select the name of a processing stage in the job where optimization should stop; leave blank to allow optimization of any processing stage.|
|Staging tablespace name||stage||A tablespace where the staging table resides; applicable to a target database connectors only||Name of an alternate tablespace where bulk staging tables will be created. Select a job parameter from the drop-down to use its value; leave blank to use the default tablespace for the staging table).|
|Maximum number of SQL nested joins||job||Any job containing one or more Join or Lookup stages||Database SQL processors can have difficulty handling large numbers of nested joins. Select "no limit" to allow InfoSphere DataStage Balanced Optimization to use as many joins as needed. Select a positive integer value N to limit optimization in any single database connector to no more than N nested joins. The default value is "no limit."|
|Hadoop language||stage||Any job containing a BDFS stage||Jaql is the only choice in this release.|
|Hadoop configuration folder||job||Any job containing a BDFS stage||This option needs to be set to a Hadoop configuration folder
accessible from all InfoSphere Information Server engine nodes. The default value is
|Hadoop temporary folder||job||Any job containing a BDFS stage||This option needs to be set to a folder path name in the Hadoop file
system. The default value is
|BalOp temporary folder||job||Any job containing a BDFS stage||This option needs to be set to a folder path name accessible
from all InfoSphere Information Server engine nodes. The default value is
|Jaql home||job||Any job containing a BDFS stage||This option needs to be set to the Jaql installation root folder. The
default value is |
|PushSequentialFile||job||Any job containing a BDFS stage||By default, the SequentialFile stage is not optimized. When the option is set to "Yes," SequentialFile stages can be pushed into the target BDFS stage.|
|StageAffinity||job||Any job containing a BDFS stage||The default stage affinity is set to "none." Select "source" if the stage should only be pushed to the source stage when possible. Select "target" if it should only be pushed to the target stage.|
Below are the supported stages for InfoSphere 9.1 release:
- Oracle Connector
- Teradata Connector
- DB2 Connector
- Netezza Connector
- Jaql Connector
InfoSphere DataStage Balanced Optimization process overview
The process overview below sets out the way to get the best results when using InfoSphere DataStage Balanced Optimization:
- Design an InfoSphere DataStage job to perform the required processing. Follow
these rules in your design:
- Express your logic in the native InfoSphere DataStage forms of stages, links, and expressions supported by the visual design canvas.
- Use the default values for stage properties wherever possible.
- Use simple target SQL queries that generated by the stage where possible. Express logic in multiple stages instead of specifying more complex SQL in a single stage. InfoSphere DataStage Balanced Optimization processes the job and produces the complex SQL for you.
- Compile and run the job with a representative set of source and target data, then verify the results. Be sure to run the job in a similar environment, including multi-node parallelism, that you will run the optimized job in. You must compile the job before trying to optimize it; compilation ensures that only valid jobs are submitted for optimization. Run the job to obtain performance and resource utilization baseline data.
- Examine the job log of the root job for warnings or errors. Some warnings about items like type or precision or length mismatches, duplicate updates, and nulls and non-null columns can lead to problems or errors in an optimized job, and these must be resolved before you optimize the job.
- Select your job design in the Designer client and select File > Optimize.
- In the InfoSphere DataStage Balanced Optimization window, select your optimization options and supply any optimization properties required.
- Generate an optimized job and save it.
- In the Designer client, examine the optimized job to see how the job design has changed.
- Compile the optimized job and run it with the same data you used when you verified the root job. Ensure that both jobs are run under the same conditions. For example, if your root job created tables, make sure to delete that table before you run the optimized job.
- Compare the results of the optimized job to the root job and see if there are improvements in performance and resource utilization. Use the performance analysis tools in the Designer client to help you.
Optimizing BDFS jobs
To execute BDFS jobs, set the following paths:
Note that JAR file versions may vary depending on the InfoSphere BigInsights version you use.
Create any sample job as, shown in Figure 1.
Figure 1. Sample job with source and target as BDFS
Before optimizing the job, create a
BalOpJaqlParams set and add following
HadoopTempspecifies a folder in the HDFS. All the temporary HDFS files created by the Jaql generated by
BalOpwill be placed in this folder.
BalOpTempspecifies a temporary folder accessible to all InfoSphere Information Server engine nodes. All the
BalOpruntime log files are placed in this folder.
JaqlServerHostName(default: not used)—
JaqlServerHostNamespecifies the host name where the Jaql server resides. When used, this value applies to all the generated Jaql stages in the job. In the Advanced Options area, there is a
JaqlServerHostNameoption shared by all the BDFS stages in the job. If the
JaqlServerHostNameoption is set as BDFS host, the
JaqlServerHostNameparameter will not be used. If the option is set as
JaqlServerHostNameparameter will be used, and its default value will be set as the BDFS host name specified in one of the optimized BDFS stages in the original job.
JaqlServerPortNumberspecifies the port number of the Jaql server. Note that the Jaql server host name is the same as the BDFS host name specified in the corresponding BDFS stage original job.
JaqlClientLibraryPathspecifies the full path of the Jaql client library. This needs to be accessible from all the nodes of InfoSphere DataStage. Note that there is no additional InfoSphere BigInsights component that needs to be accessible from the InfoSphere DataStage nodes. It is recommended that the Jaqlclient.jar file is copied from an InfoSphere BigInsights installation to a shared folder accessible from all InfoSphere DataStage nodes.
These Jaql parameters are required because when source and target as BDFS job is optimized with supported stages we get Jaql Connector stage.
Optimize the job by clicking Optimize or menu File > Optimize. The Optimization Prolog window is opened as shown in Figure 2.
BalOp optimizer window
You can select options as desired or optimize with default options and click Optimize. The job gets optimized as shown in Figure 3.
Figure 3. Optimized job
The Jaql Connector stage contains a Jaql query run by a Hadoop cluster. Jaql stages are designed to support only InfoSphere DataStage Balanced Optimization, so the Jaql stage in the optimized job cannot be customized.
If a job is partially optimized, you will see a BDFS next to the Jaql stage. The BDFS retrieves the data from the temporary HDFS file that contains the results of running the Jaql query and places the data on the output link. The implicit connection between two stages is defined by the Sync From Stage option set on the BDFS.
You can compare the designs of the original and optimized jobs by clicking the Compare tab, as shown in Figure 4.
Figure 4. Comparing original and optimized jobs
You can also see the optimization logs on the Logs tab, as shown in Figure 5. If any unsupported function is used in a stage, that stage will not be optimized, and warnings are written to the logs
Figure 5. Job optimization log on Logs tab
Save the optimized job by clicking Save Optimized Job As. The optimized job is saved, and a confirmation window is displayed, as shown in Figure 6.
Figure 6. Confirmation window of saved optimized job
If a job is fully optimized, as shown in Figure 7, you will see the optimized job, as shown in Figure 8.
Figure 7. Sample job consisting of source and target as BDFS
Figure 8. Fully optimized job
To view the results of a BDFS job, go to the InfoSphere BigInsights web console and browse through the output file.
Controlling optimization using Advanced Options tab
If you want a job to be partially optimized, select the stage at the point where the job needs to be optimized using the Advance Options tab in the Optimization window.
Figure 9. Advanced Options tab
If you have multiple Join or Lookup stages involved in a job, you can control how many nested stages of lookups should be optimized. By default, it is set to no limit, as shown in Figure 10.
Figure 10. Setting up nested joins for optimization
You can control a stage to be pushed to Source or Target by using the Stage Affinity property, as shown in Figure 11.
Figure 11. Setting up Stage Affinity for optimization
Optimizing SequentialFile stage job
The SequentialFile stage can only be pushed into a BDFS stage. There are certain
restrictions on how the SequentialFile stage is optimized. Let's say you have an
original job that involves one source sequential file, one Transformer stage, and
one BDFS target stage, as shown in Figure 10. In the optimized job, both the
sequential file stage and the transformer stage get pushed into the BDFS stage and
become a Jaql stage, as shown in Figure 11. A Jaql
read function is
generated to extract data from the sequential file. The result will then be
processed by the next Jaql statement — a
transform function that takes the
result from the read function and transforms it into data, which will then be written
to an HDFS file. All of these Jaql statements will be processed by the Jaql server
in the BigInsights/Hadoop host. Therefore, the SequentialFile needs to be accessible
from the Jaql server. Regardless, if the InfoSphere Information Server
server and InfoSphere BigInsights are on the same
machine or two different machines, the SequentialFile needs to reside on the
InfoSphere BigInsights machine or in a distributed file system that is accessible from the
InfoSphere BigInsights host.
While in the original job, the SequentialFile stage reads a file that is supposed to be only accessible from the InfoSphere Information Server engine tier. Therefore, by pushing a SequentialFile stage into a BDFS stage, it introduces the additional requirement that the SequentialFile also needs to be accessible from the InfoSphere BigInsights host. In a real customer environment, this might be desirable as SAN storage is commonly used and can be accessible from InfoSphere Information Server and InfoSphere BigInsights. A new BalOp Advanced option —Push sequential files into BDFS stage (default = No) — can be used to enable/disable this feature, as shown in Figure 12.
Figure 12. Sample SequentialFile stage job
Figure 13. Setting up SequentialFile stage for optimization
Figure 14. Optimized SequentialFile stage job using Push SequentialFile stage using Advanced option
Supported functions in Jaql
The following table lists functions that can be used in a Transformer stage and states whether they can be reproduced in SQL.
If you use any of these unsupported functions in any expression in a Transformer stage, the processing contained in that Transformer stage cannot be pushed into target or source SQL. If you want to use these items in your job, segregate them into a separate Transformer stage, which will not be pushed into a database source or target.
Table 2. Supported functions in Jaql
|Transformer function||Jaql function||Comment|
|AlNum(S)||alnum(S)||alnum is a user-defined Jaql module|
|Alpha(S)||alpha(S)||alpha is a user-defined Jaql module|
|Ceil(A)||if(A==long(A)) A else long(A)+1|
|Compare(S1,S2)||compare(S1,S2)||compare is a user-defined Jaql module|
|Compare(S1,S2,'L')||compare(S1,S2)||compare is a user-defined Jaql module|
|CompareNoCase(S1,S2)||compare(strLowerCase(S1), strLowerCase(S2))||compare is a user-defined Jaql module|
|CompareNum(S1,S2,N)||compare(substring(S1,0,N), substring(S2,0,N))||compare is a user-defined Jaql module|
|CompareNumNoCase(S1,S2,L)||compare( substring(strToLowerCase(S1),0,N) , substring(strToLowerCase(S2),0,N))||compare is a user-defined Jaql module|
|Convert(FL,TL,E)||convert(FL,TL,E)||convert is a user-defined Jaql module|
|CurrentTimestamp()||substring(serialize(now()),6,16) + " " +substring(serialize(now()),17,25)|
|CurrentTimestampMS()||substring(serialize(now()),6,16 )+ " " + substring(serialize(now()),17,29)|
|DSJobStartTimestamp()||substring(serialize(now()),6,16) + " " + substring(serialize(now()),17,25)|
|DateFromComponents(YR,MON,DY)||serialize(YR)+ "-" + serialize(MON) + "-" + serialize(DY)|
|DateToString(D)||D||Default date format is YYYY-MM-DD|
|DaysInMonth(D)||daysinmonth(D)||daysinmonth is a user defined Jaql module|
|DecimalToDate(A)||substring(serialize(A),0,4)+ "-" +substring(serialize(A),4,6)+ "-" +substring(serialize(A),6,8)|
|Div(A,B)||if(B==0) 0 else double(A)/B|
|Count(S,B)||count(strPosList(S,B))(S->main string ,B->substring)|
|DCount(S,DL)||count(strPosList(S,DL))+1||Supports all delimiters except for "."|
|IsValid(Type, A)||Not supported|
|Max(A,B)||if(A>B)A else B|
|MidnightSecondsFromTime(T)||long(substring(T,11,13))*3600 + long(substring(T,14,16))*60 + long(substring(T,17,19))|
|Min(A,B)||if(A <B)A else B|
|NullToEmpty(A)||if(isnull(A)) "" else A|
|NullToValue(A,B)||if(isnull(A)) B else A|
|NullToZero(A)||if(isnull(A)) 0 else A|
|Num(S)||num(S)||num is a user defined Jaql module|
|PadString(S,PS,PL)||S+padstring(PS,PSL)||padstring is a user-defined Jaql module|
|Right(S,N)||strJoin(slice(strSplit(S,""), (strLen(S)-N+1), strLen(S)),"")|
|Seq(N ,"allow8bits")||Not supported|
|Space(L)||padstring(" ",L)||padstring is a user-defined Jaql module|
|StringToDate(S)||S||Default date format is YYYY-MM-DD|
|StringToTime(S)||S||Default time format is HH:NN:SS.N|
|StringToTime(S,F) (F format string without microseconds)||Not supported|
|StringToTime(ARG__S,"%HH:%NN:%SS.N") (N can take values from 1 to 6)||S||Default time format is HH:NN:SS.N|
|StringToTimestamp(S)||S||Default time format is YYYY-MM-DD HH:NN:SS.N|
|StringToTimestamp(S,F) (F format string without microseconds)||Not supported|
|StringToTimestamp(S,F1) (F1 format string with microseconds)||Not supported|
|StringToTimestaamp(S,"%YYYY-%MM-%DD %HH:%NN:%SS.N") (N can take values from 1 to 6)||S|
|StripWhiteSpace(S)||strip_white_space(S)||strip_white_space is a user defined Jaql|
|Substring(S,N)||substring(S,0,N)||num is a user defined Jaql module|
|TimeDate()||time_date(now())||time_date is a new Jaql module included to enable this map|
|TimeFromComponents(HH,MI,SS,MS)||serialize(HH) + ":" +serialize(MI) + ":" + serialize(SS) + "." + serialize(MS)|
|TimeFromMidnightSeconds(S)||time_from_midnightseconds(S)||time_from_ midnightseconds is a user-defined Jaql|
|TimestampOffsetByComponents(TS, YYOFF,MMOFF,DYOFF,HHOFF,MIOFF,SSOFF)||Not supported|
|TimeToDecimal(T)||double(TimeToStr(T))||TimeToStr is a user-defined Jaql module|
|TimestampToDate(TS)||substring(TS,0,10)||Timestamp format is YYY-MM-DD HH:NN:SS.N|
|TimestampToDecimal(TS)||TStoDecimal(TS)||TStoDecimal is user-defined Jaql functions|
|TimestampToString(TS,F) (F format string without microseconds)||Not supported|
|TimeStampToString(TS,F1) (F1 format string with microseconds)||Not supported|
|TimeStampToString(TS,"%YYYY-%MM-%DD %HH:%NN:%SS.N") (N can take values from 1 to 6)||TS|
|TimestampToTime(TS)||substring(TS,11,26)||Timestamp format is YYY-MM-DD HH:NN:SS.N|
|YeardayFromDate(D)||yearday_from_date(D)||yearday_from_date is a user-defined Jaql module|
|YearweekFromDate(D)||if(mod(yearday_from_date(D),7)==0) yearday_from_date(D)/7 else long(yearday_from_date(D)/7)+1||yearday_from_date is a user-defined Jaql module|
Jaql queries do not provide equivalent functionality for all the operators you can use in job designs.
The following table lists operators that can be used in a Transformer stage and states whether they can be reproduced in Jaql.
Table 3. Supported operators in Jaql
|Transformer operator||Jaql operator|
|A!B||A or B|
|A # B||A!=B|
|A #> B||A<=B|
|A & B||A and B|
|A * B||A*B|
|A / B||double(A)/B|
|A : B||serialize(A)+serialize(B)|
|A ^ B||pow(A,B)|
|A + B||A+B|
|A < B||A < B|
|A <= B||A<=B|
|A <> B||A!=B|
|A = B||A == B|
|A =< B||A<=B|
|A => B||A => B|
|A > B||A > B|
|A >< B||A!=B|
|A >= B||A >= B|
|A and B||A and B|
|A eq B||A == B|
|A ge B||A>=B|
|A gt B||A > B|
|A le B||A <= B|
|A lt B||A<B|
|A ne B||A <> B|
|A or B||A OR B|
|IF A THEN B ELSE C||if(A) B else c|
|not A||not A|
|S[A]||if(long(N)>=0) substring(S,0,abs(N)) else (strJoin(slice(strSplit(S,""),(strLen(S)-abs(N)+1),strLen(S)),""))|
Limitations in Jaql
- The First line is column names option in BDFS and SequentialFile stage is not supported.
- For InfoSphere DataStage jobs where the SequentialFile stage can be pushed to a Jaql stage, relative paths for the file names are not supported. In Jaql, there is not a straightforward way to read files with relative paths from the local file system.
- The First line is column names option in BDFS and SequentialFile stage is not supported. Jaql can display only three digits in Timestamp Microseconds part.
- DecimalToDate Conversion function can show wrong results when the scale part of decimal column is greater than zero.
- Combined Operator expressions like If (lnkSeqFile.C1 And lnkSeqFile.C2 )<>> 0 Then Div(lnkSeqFile.C1,lnkSeqFile.C2) Else 100 will not work in Jaql. Expressions like these should split in to If ((lnkSeqFile.C1 >0) And (lnkSeqFile.C2 > 0)) Then Div(lnkSeqFile.C1,lnkSeqFile.C2) Else 100.
This article provides a close look at how to configure BDFS stage and Jaql environment required to optimize and execute a job. Also discussed InfoSphere DataStage Balanced Optimization process overview, various optimization patterns, optimization options and supported functions.
- Get more information about InfoSphere Information Server from Information Center.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.