Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Optimize InfoSphere DataStage jobs with Netezza Connector using InfoSphere DataStage Balanced Optimization

Ritika Maheshwari (ritikam@us.ibm.com), Software Engineer, IBM
Ritika Maheshwari has worked for IBM for the past 13 years. Initially, she worked in the EJB Query group for WebSphere, and now has been working on Balanced Optimization since its beta phase. She implemented the support for storing metadata in the optimized job. Ritika worked on the prototype to substitute the Prolog engine in Balanced Optimizer with a Java Rules Engine. She was specifically involved in coming up with a Java parser for the Java Rules engine, and has been actively involved in enhancing the Prolog engine functionality as well as the GUI side functionality involving dotnet.

Summary:  In this article, learn how to use a DataStage extension called IBM® InfoSphere® DataStage® Balanced Optimization to rewrite a DataStage job to enhance performance by sharing or redistributing the processing load among InfoSphere DataStage and the source and target databases. Special emphasis has been given to features unique to Netezza® Connector, including Action Column, Unique Key Column for update support, and Temporary Work Table. The article also discusses additional features available in the InfoSphere Information Server V8.7 release, including the Filter stage, and multiple output links of Copy, Filter, and Transformer stages.

Date:  26 Jan 2012
Level:  Intermediate PDF:  A4 and Letter (856KB | 24 pages)Get Adobe® Reader®
Also available in:   Chinese

Activity:  10883 views
Comments:  

Introduction

IBM Information Server is a revolutionary new software platform that helps organizations derive more value from the complex heterogeneous information spread across their systems. IBM InfoSphere DataStage is a critical component of the IBM Information Server. The parallel framework of IBM InfoSphere DataStage is also the foundation for IBM InfoSphere QualityStage, and IBM InfoSphere Information Analyzer components. IBM InfoSphere DataStage enables organizations to design data flows that extract information from multiple source systems, transform it in ways that make it more valuable, and then deliver it to one or more target databases or applications. An IBM InfoSphere DataStage job consists of individual stages linked together that describe the flow of data from a data source to a data target.

For job designs that use connectors to read or write data from data sources, you can use Balanced Optimization to give you greater control over the job. You design your job as normal, then use Balanced Optimization to redesign the job automatically to your stated preferences. This redesign process can maximize performance by minimizing the amount of input and output performed, and by balancing the processing against source, intermediate, and target environments. You can then examine the new optimized job design, and save it as a new job. Your root job design remains unchanged. The Balanced Optimization takes advantage of the power of the databases without requiring the job designer to become an expert in native SQL.

Optimization technique

The Balanced Optimizer optimizes the job based on the optimization rules specified in the optimizer. To understand the optimization technique, look at the job in Figure 1.


Figure 1. Job NZ_unshipped5 to be optimized
Job NZ_unshipped5 showing the various stages

When you click the Optimization button for the NZ_unshipped5 job, the Balanced Optimizer scans through the job to see what optimization options are applicable to the current job. The optimization page opens up with the following optimization options for the job, also shown in Figure 1.

  • Pushing Processing to database sources: NZ_unshipped5 contains a source database and number of processing stages like Join, Transformer and Lookup. This is a possible optimization option where all the processing can be done in the source database.
  • Pushing processing to database targets: NZ_unshipped5 contains a target database, therefore this is a possible optimization option where all the processing can be pushed to a target database.
  • Push data reduction processing into database targets: Data reduction processing involves reducing the volume of data before it is written to the target database. Aggregations and transformers that have constraint expressions set, are both examples of data reduction processing. In normal circumstances, you do not want to push this processing to a target, because it means that a greater volume of data is written to the target. However, pushing data reduction processing to the target database can be useful; for example, when the aggregation or filtering is performed on data that is already in the target database. NZ_unshipped5 has a Transformer stage which can potentially be pushed to target database. By default this option is not selected.
  • Push all Processing into the(target) database: Where NZ_unshipped5 data source and target are both on same Netezza database, and Netezza is capable of carrying out the processing done in transformer, join, and Lookup stage, by applying this optimization option the job could potentially be optimized into just two stages a dummy row generator stage feeding into the target database stage. The benefit of this optimization is that the data remains within the database, and there is no database I/O, and no actual data flows through the job. Complex SQL is generated to perform all the processing within the target database.

Figure 2. Optimization options page for NZ_unshipped5 job
The optimization options available for optimizing the NZ_unshipped5 job

Balanced Optimizer has a priority attached to each of these optimization options. In general pushing the processing to the target database has higher priority than pushing the processing to the source database. It always tries to push the data reduction to the data source, even if you select the push processing to database targets option. However if you choose the Push data reduction to target database option, then the data reduction stages like Aggregator and Transformer are pushed to the target database.

In order to fully optimize the job, the Balanced Optimizer makes several iterations through the job, identifying the subgroup of stages or pattern within the job on which the optimization options are applied in the order of priority, until no further pattern can be identified within the job which can potentially be optimized. All these iterations are in memory and only the fully optimized job is displayed on the GUI.

Optimization iterations for NZ_Unshipped5

The sequence in which optimization patterns are identified and the optimization options applied to them, and the resultant intermediate in memory representation of the optimized jobs are depicted in figures 3 through 14.

The first sequence of stages to be optimized using PushProcessingToTarget option are depicted in Figure 3.


Figure 3. First pattern to be optimized with option PushProcessingToTarget
The first sequence of stages that will be optimized in NZ_unshipped5 job using PushProcessingToTarget Option

Figure 4 shows the details of the Lookup stage Lookup_salesperson.


Figure 4. Input link - output link mapping in the Lookup stage Lookup_salesperson
The mapping from Input link to Output Link in the Lookup stage Lookup_salesperson

After the job is optimized using the PushProcessingToTarget option, the processing of the Lookup stage Lookup_salesperson is pushed to the database, and therefore the Lookup_salesperson stage is removed. The in-memory representation of the job after the first pattern is optimized is shown in Figure 5.


Figure 5. In-memory representation of the job after optimizing the first pattern
In memory representation of the job after optimizing the first pattern using the PushProcessingToTarget option

As shown previously in Figure 5, the SQL in the Netezza_Connector_1 stage includes the processing of the Lookup_salesperson stage, which was removed during optimization. A condensed form of the SQL statement in the Netezza_Connector_1 stage is depicted in Listing 1.


Listing 1. SQL represented by Netezza_Connector_1 shown in Figure 5
  
insert into Late_customer_contacts (DaysLate, CompanyName, ContactName,
Phone,OrderID, RequiredDate, LastName, FirstName, TitleOfCourtesy)  
..
from  
..
from
(select ORCHESTRATE.CompanyName as CompanyName, ORCHESTRATE.ContactName 
as ContactName, ORCHESTRATE.Phone as Phone, ORCHESTRATE.OrderID as 
OrderID,ORCHESTRATE.DaysLate as DaysLate, ORCHESTRATE.RequiredDate as 
RequiredDate, ORCHESTRATE.EmployeeID as EmployeeID 
from ORCHESTRATE ORCHESTRATE) DSLink12 
left outer join  
(select DSLink16RemDup1.EmployeeID as EmployeeID,DSLink16RemDup1.LastName
as LastName, DSLink16RemDup1.FirstName as FirstName, 
DSLink16RemDup1.TitleOfCourtesy as TitleOfCourtesy 
from  
..
from 
..
(select DSLink16RemDup2.EmployeeID as EmployeeID, DSLink16RemDup2.LastName
as LastName,DSLink16RemDup2.FirstName as FirstName, 
DSLink16RemDup2.TitleOfCourtesy as TitleOfCourtesy,  
row_number() over (partition by DSLink16RemDup2.EmployeeID order by 
DSLink16RemDup2.EmployeeID)
as RemdupPivotColumn3 
from  
(select employees.EmployeeID as EmployeeID, employees.LastName 
as LastName,employees.FirstName as FirstName,
employees.TitleOfCourtesy as TitleOfCourtesy from employees employees)
DSLink16RemDup2) RemdupAliasB3 
where RemdupAliasB3.RemdupPivotColumn3 = 1) DSLink16RemDup1) DSLink16 
on DSLink12.EmployeeID = DSLink16.EmployeeID) DSLink15
        

Optimization now iterates through the optimized job shown previously in Figure 5 to identify the second pattern that can be optimized. Figure 6 shows the second pattern identified by the optimizer to optimize using the PushProcessingToTarget option.


Figure 6. Second pattern to be optimized with option PushProcessingToTarget
The second sequence of stages that will be optimized in NZ_unshipped5 job using PushProcessingToTarget Option

Figure 7 shows the details of the transformations in the Days_late Transformer stage.


Figure 7. Details of the Transformer stage Days_late
The details of the transformations done by Days_late transformer stage

As a result of optimizing the second pattern using the PushProcessingToTarget option, the processing of Days_late Transformer is pushed into Netezza_Connector_1, and Days_late stage is removed from the optimized job. The in-memory representation of the job after optimizing the second pattern is shown in Figure 8.


Figure 8. In memory representation of the job after optimizing the second pattern
In memory representation of the job after optimizing the second pattern using the PushProcessingToTarget option

As shown previously in Figure 8, the SQL in the Netezza_Connector_1 stage includes the transformations of the Days_late stage which was removed during optimization. A condensed form of the SQL statement in the Netezza_Connector_1 stage is depicted in Listing 2.


Listing 2. SQL represented by Netezza_Connector_1 as shown in Figure 8
  
insert into Late_customer_contacts (DaysLate, CompanyName,...
from  
...
from  
....
from  
(select ORCHESTRATE.CompanyName as CompanyName, ORCHESTRATE.ContactName 
as ContactName, ORCHESTRATE.Phone as Phone, ORCHESTRATE.OrderID as 
OrderID,CURRENT_DATE - TO_DATE(TO_CHAR(ORCHESTRATE.OrderDate,
'YYYY-MM-DD'),'YYYY-MM-DD') as DaysLate, ORCHESTRATE.RequiredDate as
RequiredDate,ORCHESTRATE.EmployeeID as EmployeeID 
from ORCHESTRATE) DSLink12) DSLink12 
left outer join  
from  
....
from 
.....
from  
......
         

The optimizer loops through the partially optimized job again to find the third pattern that can be optimized. Figure 9 represents the third pattern that would be optimized using the PushProcessingToDBSource option.


Figure 9. Third pattern to be optimized with option PushProcessingToDBSource
The third sequence of stages that will be optimized in NZ_unshipped5 job using PushProcessingToDBSource Option

Figure 10 shows the details of the transformations in the Null_ship_dates Transformer stage.


Figure 10. Details of the Transformer stage Null_ship_dates
The details of the transformations done by Null_ship_dates transformer stage

As a result of optimizing the third pattern using the PushProcessingToDBSource option, the processing of Null_ship_dates Transformer is pushed into Orders connector, and Null_ship_dates stage is removed from the optimized job. The in-memory representation of the job after optimizing the third pattern is shown in Figure 11.


Figure 11. In-memory representation of the job after optimizing the third pattern
In memory representation of the job after optimizing the third pattern using the PushProcessingToDBSource option

As shown previously in Figure 11, the SQL in the DSLink10 includes the transformations of the Null_ship_dates stage that was removed during optimization. The SQL statement in the DSLink10 is depicted in Listing 3.


Listing 3. SQL represented by DSLink10 as shown in Figure 11
                
                select DSLink5.OrderID as OrderID, DSLink5.CustomerID as CustomerID, 
                DSLink5.OrderDate as OrderDate, DSLink5.RequiredDate as RequiredDate,
                DSLink5.EmployeeID as EmployeeID  from  
                (select orders.OrderID as OrderID, orders.CustomerID as CustomerID, 
                orders.ShippedDate as ShippedDate, orders.OrderDate as OrderDate, 
                orders.RequiredDate as RequiredDate, orders.EmployeeID as EmployeeID 
                from orders orders) DSLink5 
                where DSLink5.ShippedDate is null  
            

The optimizer again loops through the partially optimized job, as shown previously in Figure 11, to find the fourth pattern that can be optimized. Figure 12 represents the fourth pattern that would be optimized using the PushProcessingToTarget option.


Figure 12. Fourth pattern to be optimized with option PushProcessingToTarget
The fourth sequence of stages that will be optimized in NZ_unshipped5 job using PushProcessingToTarget Option

As a result of optimizing the fourth pattern using the PushProcessingToTarget option, the processing of Join_3 stage is pushed into Netezza_Connector_1, and Join_3 stage is removed from the optimized job. The in-memory representation of the job after optimizing the fourth pattern is shown in Figure 13.


Figure 13. In memory representation of the job after optimizing the fourth pattern
In memory representation of the job after optimizing the fourth pattern using the PushProcessingToTarget option

As shown previously in Figure 13, the SQL in the Netezza_Connectr_1 includes the processing of the join stage Join_3, which was removed during optimization. The SQL statement in the Netezza_Connector_1 is depicted in the Listing 4.


Listing 4. SQL represented by Netezza_Connector_1 in Figure 13
                
                insert into Late_customer_contacts (DaysLate, CompanyName, ..  
                ..
                from  
                ..
                from  
                ...
                from  
                ...
                from  
                ...
                from  
                (select ORCHESTRATE.CustomerID as CustomerID, ORCHESTRATE.CompanyName
                as CompanyName, ORCHESTRATE.ContactName as ContactName, ORCHESTRATE.Phone
                as Phone from ORCHESTRATE ORCHESTRATE) DSLink4 
                inner join  
                (select DSLink5.OrderID as OrderID, DSLink5.CustomerID as CustomerID,
                DSLink5.OrderDate as OrderDate, DSLink5.RequiredDate as RequiredDate, 
                DSLink5.EmployeeID as EmployeeID 
                from  
                (select orders.OrderID as OrderID, orders.CustomerID as CustomerID,
                orders.ShippedDate as ShippedDate, orders.OrderDate as OrderDate,
                orders.RequiredDate as RequiredDate, orders.EmployeeID as EmployeeID 
                from orders orders) DSLink5 
                where DSLink5.ShippedDate is null ) DSLink10 
                on DSLink4.CustomerID = DSLink10.CustomerID 
                order by DSLink4.CustomerID asc) Unshipped_Orders) DSLink12) DSLink12 
                left outer join  
                ...
                from  
                ..	
                from 
                ..
                from  
                
            

The optimizer again loops through the partially optimized job, shown previously in Figure 13, to find the fifth pattern that can be optimized. Figure 14 represents the fifth pattern that would be optimized using the PushAllProcessingToDBTarget option.


Figure 14. Fifth pattern to be optimized with option PushAllProcessingToDBTarget
The fifth sequence of stages that will be optimized in NZ_unshipped5 job using PushAllProcessingToDBTarget Option

As a result of optimizing the fifth pattern using the PushAllProcessingToDBTarget option, the processing of Customers stage is pushed into Netezza_Connector_1, and Customers stage is removed from the optimized job. The in-memory representation of the job after optimizing the fifth pattern is shown in Figure 15.


Figure 15. Completely optimized job as it appears on Balanced Optimizer page
Completely optimized job as it appears in the optimizer window

SQL Statement in the Connector of optimized job

This is the complete insert statement that appears in the Netezza Connector of the optimized job. Figure 16 shows the original un-optimized job again.


Figure 16. Un-optimized job NZ_unshipped5
Job NZ_unshipped5 showing the various stages

Listing 5 shows the SQL statement in Netezza Connector of the fully optimized job. The various segments in the SQL statement are de-marked by stage names to which they correspond. As various stages got removed in the process of optimization, these SQL segments were added to the SQL of the optimized job.


Listing 5. SQL represented by Netezza_Connector_1 Fully optimized job as shown in Figure 15
                    
 [Begin:Sales_staff/Lookup_salesperson] 
    insert into Late_customer_contacts (DaysLate, CompanyName,
    ContactName, Phone,OrderID, RequiredDate, LastName, FirstName,
    TitleOfCourtesy) select CAST(DSLink15.DaysLate AS INTEGER) as 
    DaysLate,CAST(DSLink15.CompanyName AS VARCHAR(40)) as CompanyName,
    CAST (DSLink15.ContactName AS VARCHAR(30)) as ContactName, 
    CAST(DSLink15.Phone AS VARCHAR(24)) as Phone, CAST(DSLink15.
    OrderID AS INTEGER) as OrderID, CAST(DSLink15.RequiredDate 
    AS TIMESTAMP) as RequiredDate, CAST(DSLink15.LastName AS 
    VARCHAR(20)) as LastName, CAST(DSLink15.FirstName AS VARCHAR(20))
    as FirstName, CAST(DSLink15.TitleOfCourtesy AS VARCHAR(20)) as 
    TitleOfCourtesy 
    from  
    (select DSLink12.DaysLate as DaysLate, DSLink12.CompanyName as 
    CompanyName,DSLink12.ContactName as ContactName,
    DSLink12.Phone as Phone, DSLink12.OrderID as OrderID, DSLink12.
    RequiredDate as RequiredDate, DSLink16. LastName as LastName, 
    DSLink16.FirstName as FirstName, DSLink16.TitleOfCourtesy as 
    TitleOfCourtesy [End:Sales_staff/Lookup_salesperson]
    [Begin:Days_late]from  
    (select DSLink12.CompanyName as CompanyName, DSLink12.ContactName as 
    ContactName, DSLink12.Phone as Phone, DSLink12.OrderID as OrderID, 
    DSLink12.DaysLate as DaysLate, DSLink12.RequiredDate as RequiredDate,
    DSLink12.EmployeeID as EmployeeID 
    from  
    (select Unshipped_Orders.CompanyName as CompanyName, Unshipped_Orders.
    ContactName as ContactName, Unshipped_Orders.Phone as Phone, 
    Unshipped_Orders.OrderID as OrderID, CURRENT_DATE - TO_DATE(TO_CHAR
    (Unshipped_Orders.OrderDate,'YYYY-MM-DD'),'YYYY-MM-DD') as DaysLate, 
    Unshipped_Orders.RequiredDate as RequiredDate, 
    Unshipped_Orders.EmployeeID as EmployeeID
    [End:Days_late][Begin:Orders/Join_3] 
    from  
    (select DSLink4.CompanyName as CompanyName, DSLink4.ContactName as 
    ContactName, DSLink4.Phone as Phone, DSLink10.OrderID as OrderID, 
    DSLink10.OrderDate as OrderDate, DSLink10.RequiredDate as 
    RequiredDate, DSLink10.EmployeeID as EmployeeID 
    from  
    (select DSLink4.CustomerID as CustomerID, DSLink4.CompanyName as 
    CompanyName, DSLink4.ContactName as ContactName, DSLink4.Phone as 
    Phone  [End:Orders/Join_3] 
    [Begin:Customers]
    from  
    (select customers.CustomerID as CustomerID, customers.CompanyName as
    CompanyName, customers.ContactName as ContactName, customers.Phone 
    as Phone from customers customers)
    [End:Customers]
    [Begin:Orders/Join_3]
    DSLink4) DSLink4 
    inner join 
    [End:Orders/Join_3]
    [Begin:Null_ship_dates]
    (select DSLink5.OrderID as OrderID, DSLink5.CustomerID as CustomerID, 
    DSLink5.OrderDate as OrderDate, DSLink5.RequiredDate as RequiredDate,
    DSLink5.EmployeeID as EmployeeID 
    from  
    (select orders.OrderID as OrderID, orders.CustomerID as CustomerID, 
    orders.ShippedDate as ShippedDate, orders.OrderDate as OrderDate, 
    orders.RequiredDate as RequiredDate, orders.EmployeeID as EmployeeID 
    from orders orders) DSLink5 
    where DSLink5.ShippedDate is null ) DSLink10 
    [End:Null_ship_dates]
    [Begin:Orders/Join_3]
    on DSLink4.CustomerID = DSLink10.CustomerID 
    order by DSLink4.CustomerID asc)
    [End:Orders/Join_3]
    [Begin:Days_late]
    Unshipped_Orders) DSLink12)
    [End:Days_late]
     [Begin:Sales_staff/Lookup_salesperson]DSLink12 
    left outer join  
    (select DSLink16RemDup1.EmployeeID as EmployeeID, DSLink16RemDup1.
    LastName as LastName, DSLink16RemDup1.FirstName as FirstName, 
    DSLink16RemDup1.TitleOfCourtesy as TitleOfCourtesy 
    from  
    (select RemdupAliasB9.EmployeeID as EmployeeID,RemdupAliasB9.LastName
    as LastName, RemdupAliasB9.FirstName as FirstName, 
    RemdupAliasB9.TitleOfCourtesy as TitleOfCourtesy 
    from 
    (select DSLink16RemDup2.EmployeeID as EmployeeID, DSLink16RemDup2.
    LastName as LastName, DSLink16RemDup2.FirstName as FirstName, 
    DSLink16RemDup2.TitleOfCourtesy as TitleOfCourtesy,  
    row_number() over (partition by DSLink16RemDup2.EmployeeID order by
    DSLink16RemDup2.EmployeeID) as RemdupPivotColumn9 
    from  
    (select employees.EmployeeID as EmployeeID, employees.LastName as 
    LastName, employees.FirstName as FirstName, employees.TitleOfCourtesy
    as TitleOfCourtesy 
    from employees employees) DSLink16RemDup2) RemdupAliasB9 
    where RemdupAliasB9.RemdupPivotColumn9 = 1) DSLink16RemDup1) DSLink16 
    on DSLink12.EmployeeID = DSLink16.EmployeeID) DSLink15
     [End:Sales_staff/Lookup_salesperson]
        


Special features in Netezza Connector

There are some features which are currently available only in Netezza Connector, such as action column, unique key column for update support, and temporary work tables.

Action column

When the Action Column write mode is used, you can designate a column in the target Netezza Connector as the action column.


Table 1. Action Column support

Action valueSQL operation
IInsert
UUpdate
DDelete

When a processing stage or a join stage is pushed into a target Netezza Connector configured with the Action Column write mode, a sequence of DELETE/UPDATE/INSERT statements is generated in the After-SQL field in the target connector. These SQL statements combine the processing logic or the join logic with the original DELETE/UPDATE/INSERT operations implied by the Action Column write mode. Figure 17 shows a job were the Action Column support is specified in the target column connector.


Figure 17. Job with Action Column support in the target Netezza Connector
Job containing a Netezza Connector with Action Column Support

Figure 18 shows how Action Column support is specified in the target Netezza Connector page. The write mode is set to Action Column, and one of the columns of the connector is designated as the Action Column.


Figure 18. Target Netezza Connector page showing Action Column support
Target Netezza Connector page showing support for Action Column by setting the write mode to action column and one of the connectors columns as Action Column

Figure 19 shows the columns page of the target Netezza Connector where the column which was specified as Action Column is set to be of type char.


Figure 19. Target Netezza Connector column page showing action column support
Target Netezza Connector Column page showing support for Action Column by setting the  action column to type char

Figure 20 shows the Transformer stage Action_Flags of the Action Column job. It shows how the Action_Flag column is mapped to a stage variable.


Figure 20. Transformer page of the Action_Flags Transformer in the action column job
Transformer page of the Action_Flags transformer in the Action Column Job showing the action column mapped to a StageVariable

Listing 6 specifies the derivation specified for the Action Column StageVariable in the Action_Flags Transformer.


Listing 6. Derivation for the Action Column StageVariable in the transformer Action_Flags
                
if DSLink4.EmployeeID < 2 then 'I' 
else if DSLink4.EmployeeID < 4 then 'U'
else if DSLink4.EmployeeID < 6 then 'D'
else if DSLink.EmployeeID < 8 then 'R'
else 'M'
                

Figure 21 shows a fully optimized Action Column job. Listing 7 shows a condensed snippet of SQL in the Netezza Connector of the optimized job.


Figure 21. Optimized Action Column job
Fully optimized Action Column Job where all the processing is pushed to the target Netezza Connector

Listing 7. SQL in the Netezza Connector of the optimized Action Column job
                
delete from founding_employees 
where  
exists (select 1 
from  
.....
inner join  
.....
from 
.... 
where (DSLink5.action_flag in ('D', 'R')) and (founding_employees
.EmployeeID = DSLink5.EmployeeID)) ; 
update founding_employees 
set 
from  
.....
inner join  
.......
where DSLink5.action_flag in ('U', 'M')) DSLink5 
where founding_employees.EmployeeID 
= DSLink5.EmployeeID; 
insert into founding_employees 
from  
....
inner join  
.....
left outer join founding_employees founding_employees_action_insert 
..
where (DSLink5.action_flag = 'M') and (founding_employees_action_insert
.EmployeeID is null ) 
union 
....
inner join  
....
where DSLink5.action_flag in ('I', 'R'))
                

Unique key column for update support

When the Use Unique Key Column option is set to Yes in the target Netezza Connector, BalOp generates additional SQL expressions in the target SQL update statement. This ensures that only one row from each group of rows with the same value in the unique key column is used in the update statement.

Essentially a clause like the one shown in Listing 8 gets appended to the update SQL of the target connector.


Listing 8. SQL clause for unique key column for update support
                
SELECT < keys > Max(< unique_key > AS unique_key_max
    FROM
	GROUP BY < Keys >)Link2
	ON < key-expressions > AND Link1.unique_key = Link2.unique_key_max
				

Figure 22 shows how the unique keys for update column support are specified in the Netezza Connector page by setting the Use Unique Key property to Yes, and specifying the unique key column and update column.


Figure 22. Target Netezza Connector page showing unique keys for update column support
Target Netezza Connector page showing support for unique keys for update columns by setting  the Use unique key property to yes and specifying the unique key column and update column.

Listing 9 shows the snippet of SQL that appears in the target Netezza Connector of an optimized job with unique keys for update column support.


Listing 9. SQL in the target Netezza Connector of an optimized job with unique keys for update column support
                
               update customersusa 
set Phone=DSLink2_USA_Customers.Phone  
from  
(select  distinct DSLink2_USA_Customers.CompanyName as CompanyName,
 DSLink2_USA_Customers.ContactName as ContactName, DSLink2_USA_Customers.Phone as Phone,
 DSLink2_USA_Customers.CustomerID as CustomerID, DSLink2_USA_Customers.Country as Country 
from  
(select DSLink1_All_Customers.CompanyName as CompanyName, 
DSLink1_All_Customers.ContactName as ContactName, DSLink1_All_Customers.Phone as Phone, 
DSLink1_All_Customers.CustomerID as CustomerID, 'United States' as Country 
from  
(select CustomerID as CustomerID, CompanyName as CompanyName, ContactName as ContactName,
 Phone as Phone 
from customers customers) DSLink1_All_Customers) DSLink2_USA_Customers) 
DSLink2_USA_Customers 
inner join  
(select DSLink2_USA_Customers.CustomerID as CustomerID, max(ContactName) as 
ContactNameUniqueKeyMax2 
from  
(select DSLink1_All_Customers.CompanyName as CompanyName, 
DSLink1_All_Customers.ContactName as ContactName, DSLink1_All_Customers.Phone as Phone,
 DSLink1_All_Customers.CustomerID as CustomerID, 'United States' as Country 
from  
(select CustomerID as CustomerID, CompanyName as CompanyName, ContactName as ContactName,
 Phone as Phone 
from customers customers) DSLink1_All_Customers) DSLink2_USA_Customers 
group by DSLink2_USA_Customers.CustomerID) DSLink2_USA_CustomersUnique_1_ 
on (DSLink2_USA_Customers.CustomerID = DSLink2_USA_CustomersUnique_1_.CustomerID) and 
(DSLink2_USA_Customers.ContactName = 
DSLink2_USA_CustomersUnique_1_.ContactNameUniqueKeyMax2) 
where customersusa.CustomerID = DSLink2_USA_Customers.CustomerID
				

Temporary work table support and staging table management

Netezza Connector uses Temporary Work Tables (TWT) as a staging area for rows received from the input link of a target Netezza Connector. The TWT tables can be managed in Automatic and Existing modes. In the Automatic mode, the TWT tables are automatically created by the Netezza Connector at run time. In the Existing mode, the TWT name is explicitly specified in the TWT Name property in the connector. InfoSphere DataStage Balanced Optimization supports both Automatic and Existing modes in different situations.

Balanced Optimizer has a property called Staging table name on the Advanced Options page. For the other connectors which do not support TWT, staging table works as TWT. In case of Netezza, where both TWT and staging table property are available, the final temporary work table name is based on following criteria shown in Table 2.


Table 2. Temporary Work Table name in Netezza

TWT modeOptimization optionFinal temporary table name
ExistingPush all processing into the databaseTWT Name is not modified unless the BalOp staging table name is set to a different table name. If the BalOp staging table name is set to a different table name, the staging table is used as the new TWT table in the optimized job.
Automatic Push all processing into the databaseEvery property in the TWT table properties area in the target connector is copied from the original job to the optimized job regardless of whether a BalOp staging table name is set or not.
ExistingPush processing to database targets option, and the Push data reduction processing to database targets.TWT Name is copied from the original job to the optimized job unless the staging table name property is set to a different table name.
AutomaticPush processing to database targets option and the Push data reduction processing to database targets.If the Separate connection for TWT option is not specified in the target connector in the original job. If a staging table name is not specified, then a staging table name is automatically generated and used as the TWT table in the optimized job. If a staging table name is specified, then the staging table is used as the TWT table. If the Separate connection for TWT option is specified in the target connector in the original job, a staging table name must be specified in the staging table name property.

Multiple output links

BalOp version 8.7 supports the optimization of Filter, Copy, and Transform stages with multiple output links, as shown in Figure 23. This feature was not available in prior releases.


Figure 23. Un-optimized and optimized Netezza filter job with multiple output links
UnOptimized Netezza job with filter stage and multiple output links.Optimized version of teh job with multiple row generators going to multiple Netezza Connectors

Conclusion

This article provided a close look at how optimization patterns are identified iteratively in the DataStage job, and optimized by pushing the work done by various stages into the source or target connector. The article also highlighted features unique to Netezza Connector, such as Action Column support, which results in the generation of a sequence of DELETE/UPDATE/INSERT statements in the After-SQL field in the target connector. This includes Temporary Work Table support providing a staging area for rows received from the input link of a target Netezza Connector, and Unique Key Column for update support which ensures that only one row from each group of rows with the same value in the unique key column is used in the update statement. In summary, Balanced Optimization is a powerful tool to enhance the performance of the DataStage job.


Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

About the author

Ritika Maheshwari has worked for IBM for the past 13 years. Initially, she worked in the EJB Query group for WebSphere, and now has been working on Balanced Optimization since its beta phase. She implemented the support for storing metadata in the optimized job. Ritika worked on the prototype to substitute the Prolog engine in Balanced Optimizer with a Java Rules Engine. She was specifically involved in coming up with a Java parser for the Java Rules engine, and has been actively involved in enhancing the Prolog engine functionality as well as the GUI side functionality involving dotnet.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=789065
ArticleTitle=Optimize InfoSphere DataStage jobs with Netezza Connector using InfoSphere DataStage Balanced Optimization
publish-date=01262012

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers