Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization

This article explains how to use InfoSphere® DataStage® Balanced Optimization to rewrite Big Data File Stage jobs into Jaql. BDFS stage operates on InfoSphere BigInsights. BDFS stage is a new stage introduced in InfoSphere Information Server 9.1. To optimize performance of BDFS jobs, InfoSphere DataStage Balanced Optimization is required. It redesigns a job to maximize performance by minimizing the amount of input and output performed and by balancing the processing against source, intermediate, and target environments. Readers will learn how to use InfoSphere DataStage Balanced Optimization in association with BDFS stage and configuration parameters required for Jaql Connector when a job is optimized.

Rao Lella (raolella@in.ibm.com), Staff Software Engineer, IBM India

Rao Lella is a senior QA engineer working for the InfoSphere DataStage team at IBM India Software Lab, Hyderabad. He has eight years of experience in IBM, working for different QA teams in the Information Server product area.



20 February 2014

Overview

For job designs that use connectors or BDFS to read or write data from data sources, you can use InfoSphere DataStage Balanced Optimization for greater control over the job. You design your job as you normally would, then use InfoSphere DataStage Balanced Optimization to redesign the job automatically to your stated preferences. This redesign process can maximize performance by minimizing the amount of input and output performed, and by balancing the processing against source, intermediate, and target environments. You can then examine the optimized job design and save it as a new job. Your original job design remains unchanged. InfoSphere DataStage Balanced Optimization enables you to take advantage of the power of the databases without becoming an expert in SQL and Hadoop clusters without becoming an expert in a MapReduce language, such as Jaql.

The following principles can lead to better performance of parallel jobs:

  • Minimize I/O and data movement — Reduce the amount of source data read by the job by performing computations within the source data server. Where possible, move processing of data to the data server and avoid extracting data just to process it and write it back to the same data server.
  • Maximize optimization within source or target data servers — Make use of the highly developed optimizations that data servers achieve by using local indices, statistics, and other specialized features.
  • Maximize parallelism — Take advantage of default InfoSphere DataStage behavior when reading and writing data servers by using parallel interfaces and pipe the data through the job so data flows from source to target without being written to intermediate destinations.

InfoSphere DataStage Balanced Optimization uses these principles to improve the potential performance of a job. It does not change or optimize machine configurations, InfoSphere DataStage configurations, database configurations, or HDFS configurations.


Optimization options

You can influence how jobs are optimized by setting options in the InfoSphere DataStage Balanced Optimization window. Only options meaningful to the particular job being optimized are displayed. Optimization attempts to perform optimizations according to the options you select. If a job contains any unsupported functionality, it will not be optimized. Optimization options apply to the job as a whole, not to specific stages or links:

  • Push processing to data targets— Select this to push processing implemented by Transformer, Sort, Aggregator, Join, and Lookup stages back into data targets where ever possible.
  • Push processing to data sources— Select this to push processing implemented by Transformer, Sort, Aggregator, Remove Duplicates, Join, and Lookup stages back into data sources wherever possible.
  • Push data reduction processing to data targets— Data-reduction processing involves reducing the volume of data before it is written to the data target. Aggregations and transforms that have constraint expressions set are examples of data-reduction processing. In normal circumstances, you do not want to push this processing to a target because it means that a greater volume of data is written to the target. However, pushing data-reduction processing to the target data server can be useful — for example, when the aggregation or filtering is performed on data already in the target data server.
  • Push all processing into the target data server— If you have a job where all the data sources and targets are in the same data server, and the server is capable of carrying out the processing tasks, setting this option moves all the job logic to the data server. The optimized job comprises a dummy row-generator stage feeding a target database or Jaql stage. The dummy stage is required because InfoSphere DataStage does not support single-stage jobs. When the optimized job is run, all the data remains within the data server. There is no data server I/O and no actual data flows through the job. Complex SQL is generated to perform all the processing within the target data server.
  • Use bulk loading of target staging tables (for database connectors only)— Select this to perform target database modification by using high-performance bulk loading of a temporary staging table in the target database. This is not available for Netezza or for Hadoop clusters. Data used to insert, update, or delete rows in the real target table is first bulk-loaded into a temporary staging table. After the staging table is bulk-loaded, SQL statements are run within the database engine (using the target connector after-SQL capability) to manage the actual destination table as originally specified and to post-process the data into its final destination table. By default, any staging table is created in the same target database space as the real target table, but you can specify a different database space for the staging table if required. By default, the staging table is always dropped if it exists, then is recreated on each run of the job. You can manually change these settings in the optimized job before you run it.

Default values for options and properties generally result in good optimization, but one may wish to change some of these to take advantage of specific environment settings. Table 1 shows the options and properties that may appear in a job involving the BDFS stage and how they should be set.

Table 1. Scope of options/property
Option/PropertyScopeAppears whenHow to set
Push processing to data targetsjobA database connector or BDFS stage is used as a data target To push any processing possible into a data target (default).
Push data reductions to data targetsjobAn aggregator or other stage that does significant data reduction appears in the job To allow aggregations and other data-reduction processing to be pushed to a data target (default).
Push all processing into target data serverjobAll data source and target stages in the job access the same data server To allow the optimizer, after all other optimizations have been performed, to finally push all processing into the target data server with no data flowing in the job (default).
Name of a stage where optimization should stopjobAny job with a processing stage Select the name of a processing stage in the job where optimization should stop; leave blank to allow optimization of any processing stage.
Staging tablespace namestageA tablespace where the staging table resides; applicable to a target database connectors onlyName of an alternate tablespace where bulk staging tables will be created. Select a job parameter from the drop-down to use its value; leave blank to use the default tablespace for the staging table).
Maximum number of SQL nested joinsjobAny job containing one or more Join or Lookup stages Database SQL processors can have difficulty handling large numbers of nested joins. Select "no limit" to allow InfoSphere DataStage Balanced Optimization to use as many joins as needed. Select a positive integer value N to limit optimization in any single database connector to no more than N nested joins. The default value is "no limit."
Hadoop languagestageAny job containing a BDFS stage Jaql is the only choice in this release.
Hadoop configuration folderjobAny job containing a BDFS stage This option needs to be set to a Hadoop configuration folder accessible from all InfoSphere Information Server engine nodes. The default value is #BalOpJaqlParams.HadoopConf#.
Hadoop temporary folderjobAny job containing a BDFS stage This option needs to be set to a folder path name in the Hadoop file system. The default value is #BalOpJaqlParams.HadoopTemp#.
BalOp temporary folderjobAny job containing a BDFS stage This option needs to be set to a folder path name accessible from all InfoSphere Information Server engine nodes. The default value is #BalOpJaqlParams.BalOpTemp#.
Jaql homejobAny job containing a BDFS stage This option needs to be set to the Jaql installation root folder. The default value is #BalOpJaqlParams.JaqlHome#.
PushSequentialFilejobAny job containing a BDFS stage By default, the SequentialFile stage is not optimized. When the option is set to "Yes," SequentialFile stages can be pushed into the target BDFS stage.
StageAffinityjobAny job containing a BDFS stage The default stage affinity is set to "none." Select "source" if the stage should only be pushed to the source stage when possible. Select "target" if it should only be pushed to the target stage.

Supported stages

Below are the supported stages for InfoSphere 9.1 release:

  • Aggregator
  • Copy
  • Filter
  • Funnel
  • Join
  • Lookup
  • RemDup
  • Sort
  • Transformer
  • Oracle Connector
  • Teradata Connector
  • DB2 Connector
  • Netezza Connector
  • BDFS
  • SequentialFile
  • Jaql Connector

InfoSphere DataStage Balanced Optimization process overview

The process overview below sets out the way to get the best results when using InfoSphere DataStage Balanced Optimization:

  1. Design an InfoSphere DataStage job to perform the required processing. Follow these rules in your design:
    • Express your logic in the native InfoSphere DataStage forms of stages, links, and expressions supported by the visual design canvas.
    • Use the default values for stage properties wherever possible.
    • Use simple target SQL queries that generated by the stage where possible. Express logic in multiple stages instead of specifying more complex SQL in a single stage. InfoSphere DataStage Balanced Optimization processes the job and produces the complex SQL for you.
  2. Compile and run the job with a representative set of source and target data, then verify the results. Be sure to run the job in a similar environment, including multi-node parallelism, that you will run the optimized job in. You must compile the job before trying to optimize it; compilation ensures that only valid jobs are submitted for optimization. Run the job to obtain performance and resource utilization baseline data.
  3. Examine the job log of the root job for warnings or errors. Some warnings about items like type or precision or length mismatches, duplicate updates, and nulls and non-null columns can lead to problems or errors in an optimized job, and these must be resolved before you optimize the job.
  4. Select your job design in the Designer client and select File > Optimize.
  5. In the InfoSphere DataStage Balanced Optimization window, select your optimization options and supply any optimization properties required.
  6. Generate an optimized job and save it.
  7. In the Designer client, examine the optimized job to see how the job design has changed.
  8. Compile the optimized job and run it with the same data you used when you verified the root job. Ensure that both jobs are run under the same conditions. For example, if your root job created tables, make sure to delete that table before you run the optimized job.
  9. Compare the results of the optimized job to the root job and see if there are improvements in performance and resource utilization. Use the performance analysis tools in the Designer client to help you.

Optimizing BDFS jobs

To execute BDFS jobs, set the following paths:

  • export BIGINSIGHTS_HOME=/opt/ibm/biginsights
  • export CLASSPATH=$BIGINSIGHTS_HOME/IHC/hadoop-0.20.2-core.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-cli-1.2.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-codec-1.3.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-el-1.0.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-httpclient-3.0.1.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-logging-1.0.4.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-logging-api-1.0.4.jar:$BIGINSIGHTS_HOME/IHC/lib/commons-net-1.4.1.jar:$BIGINSIGHTS_HOME/IHC/lib/core-3.1.1.jar:$BIGINSIGHTS_HOME/IHC/lib/hadoop-0.20.2-fairscheduler.jar:$BIGINSIGHTS_HOME/IHC/lib/hsqldb-1.8.0.10.jar:$BIGINSIGHTS_HOME/IHC/lib/jasper-compiler-5.5.12.jar:$BIGINSIGHTS_HOME/IHC/lib/jasper-runtime-5.5.12.jar:$BIGINSIGHTS_HOME/IHC/lib/jets3t-0.6.1.jar:$BIGINSIGHTS_HOME/IHC/lib/jetty-6.1.14.jar:$BIGINSIGHTS_HOME/IHC/lib/jetty-util-6.1.14.jar:$BIGINSIGHTS_HOME/IHC/lib/junit-3.8.1.jar:$BIGINSIGHTS_HOME/IHC/lib/kfs-0.2.2.jar:$BIGINSIGHTS_HOME/IHC/lib/log4j-1.2.15.jar:$BIGINSIGHTS_HOME/IHC/lib/oro-2.0.8.jar:$BIGINSIGHTS_HOME/IHC/lib/servlet-api-2.5-6.1.14.jar:$BIGINSIGHTS_HOME/IHC/lib/slf4j-api-1.4.3.jar:$BIGINSIGHTS_HOME/IHC/lib/slf4j-log4j12-1.4.3.jar:$BIGINSIGHTS_HOME/IHC/lib/xmlenc-0.52.jar:$BIGINSIGHTS_HOME/hadoop-conf:$CLASSPATH
  • export LD_LIBRARY_PATH=$BIGINSIGHTS_HOME/IHC/c++/Linux-amd64-64/lib:$LD_LIBRARY_PATH
  • export PATH=$PATH:$BIGINSIGHTS_HOME/IHC/bin/

Note that JAR file versions may vary depending on the InfoSphere BigInsights version you use.

Create any sample job as, shown in Figure 1.

Figure 1. Sample job with source and target as BDFS
Image shows sample job with source and target as BDFS

Before optimizing the job, create a BalOpJaqlParams set and add following parameters:

  • HadoopTemp (default: /tmp/balop)HadoopTemp specifies a folder in the HDFS. All the temporary HDFS files created by the Jaql generated by BalOp will be placed in this folder.
  • BalOpTemp (default: /home/biadmin/tmp)BalOpTemp specifies a temporary folder accessible to all InfoSphere Information Server engine nodes. All the BalOp runtime log files are placed in this folder.
  • JaqlServerHostName (default: not used)JaqlServerHostName specifies the host name where the Jaql server resides. When used, this value applies to all the generated Jaql stages in the job. In the Advanced Options area, there is a JaqlServerHostName option shared by all the BDFS stages in the job. If the JaqlServerHostName option is set as BDFS host, the JaqlServerHostName parameter will not be used. If the option is set as BalOpJaqlParams.JaqlServerHostName, the JaqlServerHostName parameter will be used, and its default value will be set as the BDFS host name specified in one of the optimized BDFS stages in the original job.
  • JaqlServerPortNumber (default: 8200)JaqlServerPortNumber specifies the port number of the Jaql server. Note that the Jaql server host name is the same as the BDFS host name specified in the corresponding BDFS stage original job.
  • JaqlClientLibraryPath (default: /opt/ibm/biginsights/Jaqlserver/Jaqlclient.jar)JaqlClientLibraryPath specifies the full path of the Jaql client library. This needs to be accessible from all the nodes of InfoSphere DataStage. Note that there is no additional InfoSphere BigInsights component that needs to be accessible from the InfoSphere DataStage nodes. It is recommended that the Jaqlclient.jar file is copied from an InfoSphere BigInsights installation to a shared folder accessible from all InfoSphere DataStage nodes.

These Jaql parameters are required because when source and target as BDFS job is optimized with supported stages we get Jaql Connector stage.

Optimize the job by clicking Optimize or menu File > Optimize. The Optimization Prolog window is opened as shown in Figure 2.

Figure 2. BalOp optimizer window
Image shows BalOp optimizer window

You can select options as desired or optimize with default options and click Optimize. The job gets optimized as shown in Figure 3.

Figure 3. Optimized job
Image shows optimized job

The Jaql Connector stage contains a Jaql query run by a Hadoop cluster. Jaql stages are designed to support only InfoSphere DataStage Balanced Optimization, so the Jaql stage in the optimized job cannot be customized.

If a job is partially optimized, you will see a BDFS next to the Jaql stage. The BDFS retrieves the data from the temporary HDFS file that contains the results of running the Jaql query and places the data on the output link. The implicit connection between two stages is defined by the Sync From Stage option set on the BDFS.

You can compare the designs of the original and optimized jobs by clicking the Compare tab, as shown in Figure 4.

Figure 4. Comparing original and optimized jobs
Image shows comparing original and optimized jobs

You can also see the optimization logs on the Logs tab, as shown in Figure 5. If any unsupported function is used in a stage, that stage will not be optimized, and warnings are written to the logs

Figure 5. Job optimization log on Logs tab
Image shows job optimization log on logs tab

Save the optimized job by clicking Save Optimized Job As. The optimized job is saved, and a confirmation window is displayed, as shown in Figure 6.

Figure 6. Confirmation window of saved optimized job
Image shows confirmation window of saved optimized job

If a job is fully optimized, as shown in Figure 7, you will see the optimized job, as shown in Figure 8.

Figure 7. Sample job consisting of source and target as BDFS
Image shows sample job consisting of source and target as BDFS
Figure 8. Fully optimized job
Image shows fully optimized job

To view the results of a BDFS job, go to the InfoSphere BigInsights web console and browse through the output file.


Controlling optimization using Advanced Options tab

If you want a job to be partially optimized, select the stage at the point where the job needs to be optimized using the Advance Options tab in the Optimization window.

Figure 9. Advanced Options tab
Image shows Advanced Options tab

If you have multiple Join or Lookup stages involved in a job, you can control how many nested stages of lookups should be optimized. By default, it is set to no limit, as shown in Figure 10.

Figure 10. Setting up nested joins for optimization
Image shows setting up nested joins for optimization

You can control a stage to be pushed to Source or Target by using the Stage Affinity property, as shown in Figure 11.

Figure 11. Setting up Stage Affinity for optimization
Image shows setting up nested Stage Affinity for optimization

Optimizing SequentialFile stage job

The SequentialFile stage can only be pushed into a BDFS stage. There are certain restrictions on how the SequentialFile stage is optimized. Let's say you have an original job that involves one source sequential file, one Transformer stage, and one BDFS target stage, as shown in Figure 10. In the optimized job, both the sequential file stage and the transformer stage get pushed into the BDFS stage and become a Jaql stage, as shown in Figure 11. A Jaql read function is generated to extract data from the sequential file. The result will then be processed by the next Jaql statement — a transform function that takes the result from the read function and transforms it into data, which will then be written to an HDFS file. All of these Jaql statements will be processed by the Jaql server in the BigInsights/Hadoop host. Therefore, the SequentialFile needs to be accessible from the Jaql server. Regardless, if the InfoSphere Information Server server and InfoSphere BigInsights are on the same machine or two different machines, the SequentialFile needs to reside on the InfoSphere BigInsights machine or in a distributed file system that is accessible from the InfoSphere BigInsights host.

While in the original job, the SequentialFile stage reads a file that is supposed to be only accessible from the InfoSphere Information Server engine tier. Therefore, by pushing a SequentialFile stage into a BDFS stage, it introduces the additional requirement that the SequentialFile also needs to be accessible from the InfoSphere BigInsights host. In a real customer environment, this might be desirable as SAN storage is commonly used and can be accessible from InfoSphere Information Server and InfoSphere BigInsights. A new BalOp Advanced option —Push sequential files into BDFS stage (default = No) — can be used to enable/disable this feature, as shown in Figure 12.

Figure 12. Sample SequentialFile stage job
Sample SequentialFile stage job
Figure 13. Setting up SequentialFile stage for optimization
Image shows setting up SequentialFile stage for optimization
Figure 14. Optimized SequentialFile stage job using Push SequentialFile stage using Advanced option
Optimized SequentialFile stage job using Push SequentialFile stage using Advanced option

Supported functions in Jaql

The following table lists functions that can be used in a Transformer stage and states whether they can be reproduced in SQL.

If you use any of these unsupported functions in any expression in a Transformer stage, the processing contained in that Transformer stage cannot be pushed into target or source SQL. If you want to use these items in your job, segregate them into a separate Transformer stage, which will not be pushed into a database source or target.

Table 2. Supported functions in Jaql
Transformer functionJaql functionComment
ASin(A)Not supported
Abs(A)abs(A)
Acos(A)Not supported
AlNum(S)alnum(S)alnum is a user-defined Jaql module
Alpha(S)alpha(S)alpha is a user-defined Jaql module
AsDouble(N)double(N)
AsFloat(N)double(N)
AsInteger(N)long(N)
Atan(A)Not supported
BitAnd(A,B)Not supported
BitCompress(A)Not supported
BitExpand(A)Not supported
BitOr(A,B)Not supported
Ceil(A)if(A==long(A)) A else long(A)+1
Char(N)Not supported
Char(ARG__N,"allow8bits")Not supported
Compare(S1,S2)compare(S1,S2)compare is a user-defined Jaql module
Compare(S1,S2,'L')compare(S1,S2)compare is a user-defined Jaql module
Compare(S1,S2,'R')Not supported
CompareNoCase(S1,S2)compare(strLowerCase(S1), strLowerCase(S2))compare is a user-defined Jaql module
CompareNum(S1,S2,N)compare(substring(S1,0,N), substring(S2,0,N)) compare is a user-defined Jaql module
CompareNumNoCase(S1,S2,L)compare( substring(strToLowerCase(S1),0,N) , substring(strToLowerCase(S2),0,N)) compare is a user-defined Jaql module
Convert(FL,TL,E)convert(FL,TL,E) convert is a user-defined Jaql module
Cos(A)Not supported
Cosh(A)0.5*(exp(A)+exp(-A))
CurrentDate()substring(serialize(now()),6,16)
CurrentTime()substring(serialize(now()),17,25)
CurrentTimeMS()substring(serialize(now()),17,29)
CurrentTimestamp()substring(serialize(now()),6,16) + " " +substring(serialize(now()),17,25)
CurrentTimestampMS()substring(serialize(now()),6,16 )+ " " + substring(serialize(now()),17,29)
DQuote(A)'"'||A||'"'
SQuote(A)Not supported
DSJobStartDate()substring(serialize(now()),6,16)
DSJobStartTime()substring(serialize(now()),17,25)
DSJobStartTimestamp()substring(serialize(now()),6,16) + " " + substring(serialize(now()),17,25)
Date_from_days_since(DY)Not supported
DateFromComponents(YR,MON,DY)serialize(YR)+ "-" + serialize(MON) + "-" + serialize(DY)
DateFromDaysSince(A)Not supported
DateFromDaysSince(A,GD)Not supported
DateFromJulianDay(DY)Not supported
DateOffsetByComponents(D,YYOFF,MMOFF,DYOFF)Not supported
DateOffsetByDays(D,DYOFF)Not supported
DateToDecimal(D)strJoin(strSplit(D,"-"),"")
DateToString(D)D Default date format is YYYY-MM-DD
DateToString(D,"%YYYY-%MM-%DD")D
DateToString(D,F)Not supported
Days_since_from_date(D)Not supported
DaysInMonth(D)daysinmonth(D)daysinmonth is a user defined Jaql module
DaysSinceFromDate(GD,SD)Not supported
DecimalToDFloat(N)N
DecimalToDFloat(N,"fix_zero")N
DecimalToDate(A)substring(serialize(A),0,4)+ "-" +substring(serialize(A),4,6)+ "-" +substring(serialize(A),6,8)
DecimalToDecimal(N,"trunc_zero")Not supported
DecimalToDecimal(N)N
DecimalToDecimal(N,"round_inf")Not supported
DecimalToDecimal(N,"ceil")Not supported
DecimalToDecimal(N,"floor")Not supported
DecimalToString(N)serialize(N)
DecimalToString(N,"fix_zero")serilalize(N)
DecimalToString(N,"suppress_zero")serialize(double(N)
DfloatToDecimal(N)N
DfloatToDecimal(N,"round_inf")Not supported
DfloatToDecimal(N,"ceil")Not supported
DfloatToDecimal(N,"floor")Not supported
DfloatToDecimal(N,"trunc_zero")Not supported
Div(A,B)if(B==0) 0 else double(A)/B
DownCase(A)strToUpperCase(A)
Count(S,B)count(strPosList(S,B))(S->main string ,B->substring)
DCount(S,DL)count(strPosList(S,DL))+1Supports all delimiters except for "."
Exp(A)exp(A)
Fabs(A)abs(A)
Field(A,DL,O,N)strJoin(slice(strSplit(A,DL),O-1,N),DL)
Field(S,DL,O)strSplit(S,DL)[O-1]
Floor(A)long(A)
HoursFromTime(T)long(substring(T,0,2))
Index(S,SB,1)strPosList(S,B)[0]+1
Index(S,SB,N)strPosList(S,B)[N]+1
IsNotNull(A)not isnull(A)
IsNull(A)isnull(A)
IsValid(Type, A)Not supported
JulianDayFromDate(D)Not supported
Left(S,N)substring(S,0,N)
Len(S)strLen(S)
Llabs(A)abs(A)
Ln(A)ln(A)
Max(A,B)if(A>B)A else B
MicroSecondsFromTime(T)Not supported
MidnightSecondsFromTime(T)long(substring(T,11,13))*3600 + long(substring(T,14,16))*60 + long(substring(T,17,19))
Min(A,B)if(A <B)A else B
MinutesFromTime(T)long(substring(T,3,5))
MissingValCount(A)count(*)-count(A)
Mod(A,B)mod(A,B)
MonthDayFromDate(D)long(substring(D,8,10))
MonthFromDate(D)long(substring(D,5,7))
Neg(A)-A
NextWeekdayFromDate(SD,DW)Not supported
NullToEmpty(A)if(isnull(A)) "" else A
NullToValue(A,B)if(isnull(A)) B else A
NullToZero(A)if(isnull(A)) 0 else A
Num(S)num(S)num is a user defined Jaql module
PadString(S,PS,PL)S+padstring(PS,PSL)padstring is a user-defined Jaql module
PreviousWeekdayFromDate(SD,DW)Not supported
Pwr(A,B)pow(A,B)
Rand()Not supported
Random()Not supported
RawLength(A)Not supported
Right(S,N)strJoin(slice(strSplit(S,""), (strLen(S)-N+1), strLen(S)),"")
SecondsFromTime(T)long(substring(T,6,8))
SecondsSinceFromTimestamp(TS1,TS2)Not supported
Seq(N)Not supported
Seq(N ,"allow8bits")Not supported
SetBit(A)Not supported
SetNull()null
SeqAt(S,I)Not supported
Sin(A)Not supported
Sinh(A)0.5*(exp(A)-exp(A))
Soundex(A)Not supported
Space(L)padstring(" ",L)padstring is a user-defined Jaql module
Sqrt(A)pow(A,0.5)
Str(S,R)Not supported
StringToDate(S)SDefault date format is YYYY-MM-DD
StringToDate(S,"%YYYY-%MM-%DD")S
StringToDate(S,F)Not supported
StringToDecimal(S)double(S)
StringToDecimal(S,"round_inf")Not supported
StringToDecimal(S,"trunc_zero")Not supported
StringToDecimal(S,"ceil")Not supported
StringToDecimal(S,"floor")Not supported
StringToRaw(S)Not supported
StringToTime(S)S Default time format is HH:NN:SS.N
StringToTime(S,F) (F format string without microseconds)Not supported
StringToTime(S,"%HH:%NN:%SS")S
StringToTime(ARG__S,"%HH:%NN:%SS.N") (N can take values from 1 to 6)SDefault time format is HH:NN:SS.N
StringToTimestamp(S)S Default time format is YYYY-MM-DD HH:NN:SS.N
StringToTimestamp(S,F) (F format string without microseconds)Not supported
StringToTimestamp(S,F1) (F1 format string with microseconds)Not supported
StringToTimestamp(S,"%YYYY-%MM-%DD %HH:%NN:%SS")S
StringToTimestaamp(S,"%YYYY-%MM-%DD %HH:%NN:%SS.N") (N can take values from 1 to 6)S
StripWhiteSpace(S)strip_white_space(S)strip_white_space is a user defined Jaql
Substring(S,N)substring(S,0,N)num is a user defined Jaql module
Substring(S,Start,Len)substring(S,Start,Start+Len)
Tan(A)Not supported
Tanh(A)(exp(2*A)-1)/(exp(2*A)+1)
TimeDate()time_date(now())time_date is a new Jaql module included to enable this map
TimeFromComponents(HH,MI,SS,MS)serialize(HH) + ":" +serialize(MI) + ":" + serialize(SS) + "." + serialize(MS)
TimeFromMidnightSeconds(S)time_from_midnightseconds(S)time_from_ midnightseconds is a user-defined Jaql
TimeOffsetByComponents(T,HHOFF,MIOFF,SSOFF)Not supported
TimestampOffsetByComponents(TS, YYOFF,MMOFF,DYOFF,HHOFF,MIOFF,SSOFF)Not supported
TimeOffsetBySeconds(T,SSOFF)Not supported
TimeStampOffsetBySeconds(TS,SSOFF)Not supported
TimeToDecimal(T)double(TimeToStr(T))TimeToStr is a user-defined Jaql module
TimeToString(T)T
TimeToString(T,F)Not supported
TimestampFromDateTime(D,T)D+" "+T
TimestampFromSecondsSince(S)Not supported
TimestampFromSecondsSince(S,TS)Not supported
TimestampToDate(TS)substring(TS,0,10)Timestamp format is YYY-MM-DD HH:NN:SS.N
TimestampToDecimal(TS)TStoDecimal(TS) TStoDecimal is user-defined Jaql functions
TimestampToString(TS)TS
TimestampToString(TS,F) (F format string without microseconds)Not supported
TimeStampToString(TS,F1) (F1 format string with microseconds)Not supported
TimeStampToString(TS,"%YYYY-%MM-%DD %HH:%NN:%SS")TS
TimeStampToString(TS,"%YYYY-%MM-%DD %HH:%NN:%SS.N") (N can take values from 1 to 6)TS
TimestampToTime(TS)substring(TS,11,26) Timestamp format is YYY-MM-DD HH:NN:SS.N
Trim(S,'E')Not supported
Trim(S,'F')Not supported
Trim(S,SC,'A')Not supported
Trim(S,SC,'B')Not supported
Trim(S,SC,'L')Not supported
Trim(S,SC,'T')Not supported
TrimB(S)Not supported
TrimF(S)Not supported
TrimLeadingTrailing(S)Not supported
Upcase(A)strToUpperCase(A)
Variance(A)Not supported
VarianceNRecs(A)Not supported
WeekdayFromDate(D)Not supported
YearFromDate(D)long(substring(D,0,4))
YeardayFromDate(D)yearday_from_date(D)yearday_from_date is a user-defined Jaql module
YearweekFromDate(D)if(mod(yearday_from_date(D),7)==0) yearday_from_date(D)/7 else long(yearday_from_date(D)/7)+1 yearday_from_date is a user-defined Jaql module

Jaql queries do not provide equivalent functionality for all the operators you can use in job designs.

The following table lists operators that can be used in a Transformer stage and states whether they can be reproduced in Jaql.

Table 3. Supported operators in Jaql
Transformer operatorJaql operator
-A-A
A-BA-b
A!BA or B
A # BA!=B
A#<BA>=B
A #> BA<=B
A & BA and B
A * BA*B
A / Bdouble(A)/B
A : Bserialize(A)+serialize(B)
A ^ Bpow(A,B)
A + BA+B
A < BA < B
A <= BA<=B
A <> BA!=B
A = BA == B
A =< BA<=B
A => BA => B
A > BA > B
A >< BA!=B
A >= BA >= B
A and BA and B
A eq BA == B
A ge BA>=B
A gt BA > B
A le BA <= B
A lt BA<B
A ne BA <> B
A or BA OR B
A**Bpow(A,B)
IF A THEN B ELSE Cif(A) B else c
not Anot A
not(A)not(A)
S[A]if(long(N)>=0) substring(S,0,abs(N)) else (strJoin(slice(strSplit(S,""),(strLen(S)-abs(N)+1),strLen(S)),""))

Limitations in Jaql

  • The First line is column names option in BDFS and SequentialFile stage is not supported.
  • For InfoSphere DataStage jobs where the SequentialFile stage can be pushed to a Jaql stage, relative paths for the file names are not supported. In Jaql, there is not a straightforward way to read files with relative paths from the local file system.
  • The First line is column names option in BDFS and SequentialFile stage is not supported. Jaql can display only three digits in Timestamp Microseconds part.
  • DecimalToDate Conversion function can show wrong results when the scale part of decimal column is greater than zero.
  • Combined Operator expressions like If (lnkSeqFile.C1 And lnkSeqFile.C2 )<>> 0 Then Div(lnkSeqFile.C1,lnkSeqFile.C2) Else 100 will not work in Jaql. Expressions like these should split in to If ((lnkSeqFile.C1 >0) And (lnkSeqFile.C2 > 0)) Then Div(lnkSeqFile.C1,lnkSeqFile.C2) Else 100.

Conclusion

This article provides a close look at how to configure BDFS stage and Jaql environment required to optimize and execute a job. Also discussed InfoSphere DataStage Balanced Optimization process overview, various optimization patterns, optimization options and supported functions.

Resources

Learn

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=963110
ArticleTitle=Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization
publish-date=02202014