Tuning the Oracle Connector performance in IBM InfoSphere DataStage, Part 2: Optimization of bulk load operation and considerations for reject links and data types

The Oracle® Connector is a connectivity component in IBM® InfoSphere® Information Server. It is utilized by IBM InfoSphere DataStage® and other products in the Information Server suite to perform extract, lookup, load, and metadata import operations on Oracle databases. This article is Part 2 of a series of two articles that provide a set of guidelines for tuning the Oracle Connector stages in DataStage parallel jobs with the goal of maximizing their performance.

Share:

Predrag Maksimovic (predragm@us.ibm.com), Senior Software Engineer, IBM

Predrag Maksimovic photoPredrag Maksimovic is a member of the InfoSphere Connectivity team at IBM in Boca Raton, Florida. He has 15 years of experience in design and development of connectivity components and tools for enterprise data integration. His areas of interests include relational databases, enterprise messaging systems and big data technologies.



04 April 2013

Also available in Chinese

Introduction

In the Part 1 article in this series we provided a brief overview of the IBM InfoSphere DataStage product in the IBM Information Server suite of products and explained the role of the Oracle Connector stage in DataStage jobs. We explained the difference between conductor and player processes for the stage. We introduced the concepts of parallelism and partitioning and explained how their meaning differs between the DataStage environment and the Oracle database environment. We covered in detail the performance tuning and troubleshooting of the Oracle Connector stage configured to perform SQL statements on the database, including query statements used to fetch and lookup rows in the database as well as DML and PL/SQL statements used to insert, update and delete rows in the database.

In the Part 2 article of the series we will continue our coverage of the Oracle Connector performance tuning. First we will focus on performance aspects of the bulk load mode in the connector. In this mode the connector is used to load data to the database by utilizing the Oracle direct path interface. We will cover the operation of loading the records to the database as well as the operations that often need to be performed on the table indexes and constraints before and after loading the records. We will then present a few guidelines regarding the use of reject links with the connector. We will show how reject links can affect the connector performance in some cases and present some alternative approaches for your consideration. We will end the article by presenting you with a number of performance tuning strategies oriented towards the handling of different Oracle data types in the connector.

As was the case with the Part 1 article, the Part 2 article assumes the use of DataStage parallel jobs and Information Server Version 9.1, although some of the presented concepts would apply also to DataStage server jobs as well as the earlier Information Server versions.


Bulk load

When the Oracle Connector stage is configured to run in bulk load mode it utilizes the Oracle direct path interface to write data to the target database table. It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks. To configure the stage to run in bulk load mode the Write mode property of the stage needs to be set to value Bulk load. In this section, we will look at the connector properties and environment variables that play important role for tuning bulk load operation performance in the stage, as shown in Figure 1.

Figure 1. Connector properties that affect bulk load performance
Properties that play important role for bulk load operation performance.

Running the stage in Bulk load write mode typically results in better performance than when running it in Insert write mode. However, Bulk load write mode imposes several restrictions that are not present when the Insert write mode is used, especially when running the stage in parallel execution mode with Allow concurrent load sessions connector property set to value Yes. These restrictions are primarily related to the handling of triggers, indexes and constraints defined on the target table.

The information in the remainder of this section is intended to help you answer the following questions:

  • Should I configure the stage to run in Bulk load or Insert write mode?
  • If I choose to run the stage in Bulk load write mode, how should I configure the remaining settings in the stage in order to achieve optimal performance?

Data volume considerations

When deciding to configure the stage in Bulk load or Insert write mode, the critical aspect to consider is the data volume, in other words how many records on average is the stage writing to the target table in a single job invocation.

If the data volume is relatively small, the preferred option would be to use Insert write mode due to more flexibility that this write mode provides in respect to maintaining indexes, constraints and triggers of the target table. With the Insert write mode all the table indexes, constraints and triggers are enforced during the job execution and remain valid after the job completes. For example, if the stage that is running in Insert write mode tries to insert a record to the target table that violates the primary key constraint in the table, the stage can be configured to handle this scenario and to continue processing the remaining records in a number of ways that are not readily available when the stage is running in Bulk load write mode.

  • It can have a reject link defined and configured to process records that have violated constraints. Each input record that violates table constraints will be routed to the reject link, with optionally included error code and message text explaining why the record was rejected. The existing row in the table for which the primary key was violated will remain in the table.
  • It can be configured to run in Insert then update write mode in which case the existing row in the target table will be updated with the values from the record that violated the constraint.
  • It can be configured to run in Insert new rows only write mode, in which case the record that violated the constraint will be skipped (ignored).
  • It can be configured with the INSERT statement that contains LOG ERRORS INTO clause which will result in Oracle redirecting records that violated the constraint along with the error information to the specified error logging table.

If the data volume is relatively large and the primary goal is to append the data to the target table as quickly as possible then the Bulk load mode should be considered.

The question is what should be considered a small data volume as opposed to large data volume. This decision will depend upon other things like how much performance you are willing to sacrifice when you choose the Insert write mode, which in turn depends on the size of the batch window that is available to push the data to the database and the available system resources. As a rule of thumb, thousands of records would likely be considered a small data volume, while millions of records would likely be considered a large data volume.

If the table does not have any indexes, constraints and triggers defined then Bulk load write mode would make a good candidate irrespective of the data volume involved because many of the restrictions imposed by this write mode would not apply any more. This includes the scenario where the table does in fact have indexes, constraints and triggers defined but you have procedures in place to disable or drop them prior to loading the data and then enable or rebuild them after the load is complete.

A simple test like the following can help you decide which write mode to choose:

  1. Design a test job with Oracle Connector stage that has an input link. In the test job provide records similar in volume and structure to the records that will be processed by the actual job in production.
  2. Ensure that the target database table used in the test job has the same definition as the table that will be used in production. If the production job performs incremental data loads to an existing table in the database, ensure that the table used for the test job is initially populated with a similar number of rows as the table that will be used in production.
  3. Run the job a few times in insert mode and capture the average time it took the job to complete.
  4. Manually disable any indexes, constraints and triggers on the target table and repeat the job a few times with the connector stage configured in bulk load write mode and capture the average time it took the job to complete.
  5. Compare the results from the previous two sets of job runs. If the numbers are similar that would be an indication that the Insert write mode is sufficient. If bulk load completes much faster, determine which additional operations need to be performed on the table to restore the constraints and indexes. Issue these operations from a tool like SQL*Plus to see how much time they take to complete. If that time combined with the time to bulk load the data is still considerably shorter than the time that it took to insert the data, then the bulk load mode is likely to be a better choice.

Handling triggers

If the target table has triggers that need to be fired for every row written to the table then the bulk load mode should not be used because database triggers are not supported in this mode. To bulk load data to a table that has triggers, the triggers need to be disabled prior to the load. The connector provides an option to disable the triggers automatically before to load and enable them after the load. Once the triggers are enabled they will fire only for the rows inserted to the table after that point. They will not fire for the rows that were loaded while the triggers were disabled.

Handling indexes

If the table has indexes, you can configure the stage to disable all the indexes when performing the load. To do this set the Index maintenance connector property to value Skip all. This is typically the best option to choose as it allows you to run the stage without restrictions in parallel execution mode and with any Partition type value selected on the Partitioning page for the input link. However, all indexes are marked UNUSABLE after the load completes and should be rebuilt. This is where the data volume and the number of indexes play a critical role. If you are incrementally loading a relatively small number of records compared to the number of records already present in the table, it may be significantly faster to run the stage in Insert write mode and have the database maintain indexes throughout the insert operation instead of having to rebuild all the indexes after bulk load. On the other hand, if the target table is empty and you need to load many records to it then you will likely be benefited in using the Bulk load write mode because the indexes will need to be built on all the rows, and in that case, it may be faster to first load all the records without maintaining the indexes and then build the indexes from scratch.

Instead of rebuilding the indexes in an external tool after the load, you can configure the stage to request from Oracle database to maintain the indexes during the load by setting the Index maintenance option property to Do not skip unusable or Skip unusable. The use of these two options is highly restrictive though and in most cases will not be the most optimal choice. Here are the things that you need to keep in mind if you consider using one of these two options:

  • You will need to set Allow concurrent load sessions connector property to No, otherwise you will receive Oracle error "ORA-26002: Table string has index defined upon it" before any data is loaded. This requirement further implies that you will not be able to run the stage in parallel execution mode, except for one special case when loading to a partitioned table which is covered later in the article in the section that talks about loading records to a partitioned table.
  • If the table has a UNIQUE index, the index will remain VALID after the job completes but only if none of the loaded rows have violated the uniqueness imposed by the index. Otherwise the index will automatically be marked UNUSABLE even though the job completes successfully. Subsequent attempts to run the job will result in failure and the reported Oracle error depend on the Index maintenance option property value. For Do not skip unusable value the error will be "ORA-26028: index string.string initially in unusable state" and for Skip unusable value the error will be "ORA-26026: unique index string.string initially in unusable state".
  • If the table has a NON-UNIQUE index, the index will remain VALID after the load. If the index was marked UNUSABLE before the load, the behavior will depend on the Index maintenance option property value. For Do not skip unusable value it will result in Oracle error "ORA-26002: Table string has index defined upon it" and for Skip unusable value the job will complete and the index will remain in UNUSABLE state.

You can configure the stage to automatically rebuild indexes after the load, and you can control the LOGGING and PARALLEL clauses of the index rebuild statements to optimize the index rebuild process. You can configure the stage to fail the job if any index rebuild statement fails or to only issue warning messages. To improve the performance the connector stage will rebuild the indexes only when necessary. It will inspect the state of each index before rebuilding it. If the index is marked VALID the connector will skip rebuilding it. This applies also to locally and globally partitioned indexes. In that case, the connector will inspect the index partitions and subpartitions (if the index is locally partitioned and the table is composite-partitioned) and rebuild only those index partitions and subpartitions that are marked UNUSABLE.

Handling constraints

The presence of integrity and referential constraints on the table has implications on the bulk load operation similar to those imposed by the indexes.

NOT NULL and CHECK constraints do not need to be disabled prior to the load. If any input records violate the constraint the job will fail and no records will be loaded to the table. The exception to this is the case when the stage is configured to run in parallel execution mode and some player processes finish loading their record partitions before the player process that encountered the record in error. In that case, the records loaded by the player processes that have already finished will be committed to the table. From the performance point of view, a good approach for dealing with these two types of constraints is to leave them enabled on the table during the load but ensure that no records violate them by performing the necessary checking and error handling upstream of the connector stage. We provide some concrete suggestions later in the section that addresses reject links.

FOREIGN KEY, UNIQUE and PRIMARY KEY constraints also do not need to be disabled prior to the load but are effectively ignored by the load operation. Any records that violate these constraints will still be loaded and the constraints will remain VALIDATED and ENABLED. To ensure that these constraints are valid for the rows loaded by the connector stage you will need to disable them prior to the load and enable them after the load. Keep in mind that these types of constraints are usually accompanied by indexes. The database enforces UNIQUE and PRIMARY KEY constraints through unique indexes and FOREIGN KEY constraints often have user-defined indexes created for them. In the earlier section about handling indexes we explained the effects that indexes have on the load operation. The same effects apply in this case and need to be taken into consideration.

The connector provides an option to disable constraints before the load. You can specify this by setting the Disable constraints property to Yes. When this is done, the load operation typically completes faster than the insert operation. However, you have to enable and validate the constraints following the load and deal with the loaded records that have violated the constraints. That part can have a major impact on the overall performance. If you have a mechanism in place outside of the DataStage job to enable and validate the constraints after the load, then having the connector disable them before the load will provide for the best performance of the job itself since there will be no restrictions during the load due to constraints and no time will be spent at the end of the job to enable the constraints.

You can configure the stage to enable and validate the constraints after the load by setting Enable constraints property to Yes. But, if any constraint fails the validation at that time, the job will fail and the constraint will remain in DISABLED and NOT VALIDATED state. You can specify the exceptions table in the Exceptions table name property to which the ROWIDs of the rows that violated the constraints will be stored, along with the name of the constraint that they have violated. Doing this will still result in the job failure if any of the constraints cannot be enabled, but you will be able to determine which rows violated which constraint. You can further configure the stage to delete the rows from the table that violated the constraints and to try to enable the constraints again but that will discard all the rows in the table found to be in violation of the constraints. To store them somewhere instead of discarding them you can define a reject link for the stage and enable SQL error – constraint violation condition on it. This will have a major impact on the performance because it enforces sequential execution of the stage in a single player process. Rather than accepting this restriction you should consider running the stage in parallel execution mode and specifying Insert value for the Write mode as that will potentially result in better overall performance. In general, it is best to avoid the use of reject links if possible. The section about reject links provides some concrete ideas in this regard.

If you configure the stage to disable table constraints prior to the load and enable them after the load and reject the records that have violated constraints, all of the records in violation of UNIQUE and PRIMARY KEY constraints defined on the table will be rejected. The connector doesn't make distinction between the rows that existed in the table prior to load and those that were loaded by the connector. For example, if a row loaded by the connector violated the primary key constraint because another row with the same primary key value existed in the table prior to the load, both rows will be deleted from the table and sent to the reject link. To have the stage delete and reject only those rows in violation of constraint that were not in the table prior to the load, you will need to implement a custom solution such as the one shown in the next section.

Custom handling of indexes and constraints

When the connector stage is configured to handle triggers, indexes or constraints automatically before and after the load, the connector will accomplish this by issuing various Data Definition Language (DDL) statements on the database. For enabling and disabling triggers, it will use ALTER TRIGGER statements, for rebuilding indexes it will use ALTER INDEX statements and for enabling and disabling constraints it will use ALTER TABLE statements. You can locate these statements in the job log when you run the job with CC_MSG_LEVEL environment variable set to value 2. The connector provides properties that you can use to include or remove particular clauses in these statements. For example, it provides properties that allow you to control the presence of PARALLEL and LOGGING clauses in the ALTER INDEX … REBUILD statement.

In some cases, you may wish to issue your own statements for handling triggers, indexes and constraints instead of having the connector automatically generate and run them. For example, you may wish to reference table triggers, indexes and constraints directly by their names instead of having the connector query the database dictionary views to dynamically discover those objects. You may wish to manually include certain clauses in these statements that you cannot configure through the connector properties. You can achieve this outside of the stage by running external tools and scripts before and after running the job. You can do it through the stage by utilizing the Before SQL statement and After SQL statement connector properties. In each of these two properties you can specify either a list of semicolon-separated SQL statements or an anonymous PL/SQL block to be executed. The connector stage executes these statements in the conductor process. It will run the Before SQL statement statements before any player process has been created for the stage to process the records. It will run the After SQL statement statements after all the player processes have finished loading the records. The following example illustrates how this can be accomplished.

Suppose that you want to load records to the table TABLE_ITEMS which stores information about certain type of items. The table has the ID column that represents item identifiers and serves as the primary key for the table, which further means that the table has a unique index defined on this same column. The table also has the NAME column that stores item names and for which a non-unique index is defined on the table. Finally, the table has the LOAD_DATE column which represents the date and time when the item row was loaded. The statement shown in Listing 1 can be used to create this table.

Listing 1. Create table statement example
CREATE TABLE TABLE_ITEMS(ID NUMBER(10), NAME VARCHAR2(20), LOAD_DATE DATE);
ALTER TABLE TABLE_ITEMS ADD CONSTRAINT PK_TABLE_ITEMS_ID PRIMARY KEY (ID);
CREATE INDEX IDX_TABLE_ITEMS_NAME ON TABLE_ITEMS(NAME);

The load date is set when the job runs and is passed to the job as a job parameter in format YYYY-MM-SS HH24:MI:SS. Various mechanisms can be used to set and pass this value to the job. For example, if you are invoking the job with dsjob command from command line, the system date command can be used to produce this value. For example, on Linux you can use the command shown in Listing 2 to run the job and pass the LOAD_DATE job parameter value based on the current system date and time.

Listing 2. dsjob command invocation
dsjob -run -jobstatus -userstatus -param LOAD_DATE="`date +'%F %T'`" project_namejob_name

Another option for passing the current date and time would be to create a sequence job that contains a single activity stage which in turn invokes the load job and initializes the LOAD_DATE job parameter to value DSJobStartTimestamp, which is a built-in DataStage macro that returns the job start timestamp.

Suppose the table contains many rows already and that you are running your job to load additional rows. Further, most of the records that you are loading are for new items, but some records represent existing items, meaning they have the ID field value for which there is already a row in the table with that same ID value. You want to load the data as fast as possible, and you also want the constraints and indexes to be enabled and valid for all the table rows after the load. If any loaded rows violate the primary key constraint, you want to store them to an alternate table TABLE_ITEMS_DUPLICATES which has the same columns as the TABLE_ITEMS table but does not have any constraints and indexes defined on it, as shown in Listing 3.

Listing 3. Create table statement for duplicates
CREATE TABLE TABLE_ITEMS_DUPLICATES(ID NUMBER(10),
                                    NAME VARCHAR2(20),
                                    LOAD_DATE DATE);

You could achieve this task with the stage configured to disable constraints before the load, skip all indexes during the load and rebuild indexes and enable constraints after the load. You could define reject link on the stage, select the SQL error – constraint violations condition on the reject link and direct this link to another Oracle Connector stage configured to insert rows to the TABLE_ITEMS_DUPLICATES table. But this approach would have the following limitations:

  • Because you configured the stage to disable the constraints and reject rows that violated constraints after enabling them, the connector enforces sequential execution in a single player process. The rejecting of records on the reject link can only take place in a player process and in this case the rejecting is done when the connector enables the constraints after the load. For the player process to enable the constraints after it has finished loading the records, it needs to be certain that it is the only player process for the stage. But in practice, you may wish to run the stage in parallel execution mode so that you can utilize the available system resources on the engine tier.
  • The connector rejects all the rows that violated the primary key constraint including the rows that existed in the table prior to the load but that happen to have the same ID value as some of the rows that were loaded. This is because at the time the constraints are enabled all the rows are already in the table storage and no distinction is made at that time between the rows that were in the table prior to the load and those that were just loaded. In practice, you may wish to reject only the rows that were just loaded by the job and that violated the constraint but not any of the rows that existed in the table prior to running the job.

The following is one possible approach to avoid these limitations. Specify the statement shown in Listing 4 in the Before SQL statement property.

Listing 4. Statement to disable PK constraint
ALTER TABLE TABLE_ITEMS DISABLE CONSTRAINT PK_TABLE_ITEMS_ID;

This statement will explicitly disable the constraint PK_TABLE_ITEMS on the table prior to loading records to the table.

Set the Index maintenance option connector property to Skip all. Set the Allow concurrent load sessions connector property to value Yes. Do not define reject link for the stage. Configure the stage to run in parallel execution mode.

Define the exceptions table TABLE_ITEMS_EXCEPTION. This table will be needed for the EXCEPTIONS INTO option of the ENABLE CONSTRAINT clause in the ALTER TABLE statement that we will want to invoke (explained further down). The format that the exceptions table needs to follow can be found in the SQL script UTLEXCPT.SQL included with the Oracle database product installation. Based on the information in this script, to create the TABLE_ITEMS_EXCEPTION exceptions table for this example execute the statement shown in Listing 5.

Listing 5. Statement to create the exceptions table
CREATE TABLE TABLE_ITEMS_EXCEPTIONS(ROW_ID ROWID,
                                    OWNER VARCHAR2(30),
                                    TABLE_NAME VARCHAR2(30),
                                    CONSTRAINT VARCHAR2(30));

Specify the PL/SQL anonymous block shown in Listing 6 in the After SQL statement property. The connector will submit it to the database once all the records have been loaded. See the comments embedded in the PL/SQL code for details about individual operations performed by this PL/SQL block.

Listing 6. Custom PL/SQL code for handling indexes and constraints
DECLARE
    -- Define the exception for handling constraint validation error ORA-02437.
    cannot_validate_constraint EXCEPTION;
    PRAGMA EXCEPTION_INIT(cannot_validate_constraint, -2437);
BEGIN
    -- Truncate the tables TABLE_ITEMS_DUPLICATES and TABLE_ITEMS_EXCEPTIONS in case
    -- they contain any rows from the previous job runs.
    EXECUTE IMMEDIATE 'TRUNCATE TABLE TABLE_ITEMS_DUPLICATES';
    EXECUTE IMMEDIATE 'TRUNCATE TABLE TABLE_ITEMS_EXCEPTIONS';
                
    -- Try to enable the PK_TABLE_ITEMS_ID constraint and to build the underlying
    -- unique index. If any rows are in violation of the constraint then store
    -- their ROWIDs to the TABLE_ITEMS_EXCEPTIONS table.
    EXECUTE IMMEDIATE 'ALTER TABLE TABLE_ITEMS
      ENABLE CONSTRAINT PK_TABLE_ITEMS_ID
      EXCEPTIONS INTO TABLE_ITEMS_EXCEPTIONS';
                
    EXCEPTION WHEN cannot_validate_constraint THEN
    -- The constraint could not be enabled. The constraint violations need to be
    -- handled before trying again to enable the constraint.
    BEGIN
      -- Set the default date format for the session to match the format of
      -- LOAD_DATE job parameter value.
      EXECUTE IMMEDIATE 'ALTER SESSION SET NLS_DATE_FORMAT=''YYYY-MM-DD HH24:MI:SS''';
                
      -- Copy rows from TABLE_ITEMS table to TABLE_ITEMS_DUPLICATES table that
      -- violated the constraint and were loaded by this job. The rows that were
      -- loaded by this job will have the LOAD_DATE column value that matches the
      -- LOAD_DATE job parameter value.
      INSERT INTO TABLE_ITEMS_DUPLICATES
        SELECT ID, NAME, LOAD_DATE
        FROM TABLE_ITEMS
        WHERE ROWID IN (SELECT ROW_ID FROM TABLE_ITEMS_EXCEPTIONS) AND
              LOAD_DATE='#LOAD_DATE#';
               
      -- Delete the rows from TABLE_ITEMS that were copied to TABLE_ITEMS_DUPLICATES.
      DELETE FROM TABLE_ITEMS
      WHERE ROWID IN (SELECT ROW_ID FROM TABLE_ITEMS_EXCEPTIONS) AND
            LOAD_DATE='#LOAD_DATE#';
                
      -- Try to enable the constraint again. This time the operation should be
      -- successful.
      EXECUTE IMMEDIATE 'ALTER TABLE TABLE_ITEMS ENABLE CONSTRAINT PK_TABLE_ITEMS_ID';
                
      -- Rebuild the non-unique index IDX_TBL_ITEMS_COL_NAME that was disabled
      -- during the load.
      EXECUTE IMMEDIATE 'ALTER INDEX IDX_TABLE_ITEMS_NAME REBUILD PARALLEL NOLOGGING';
   END;
END;

Utilizing Oracle date cache

You can configure the stage to take advantage of the Oracle date cache feature by setting the Use Oracle date cache property to value Yes. This feature is only available in bulk load mode and can significantly improve the load performance when the input link has one or more Date or Timestamp columns, those columns are used to load values to TIMESTAMP table columns and there are many repeating values for those columns.

The connector prepares values for TIMESTAMP table columns in a format that requires them to be converted prior to being stored in the table. When the date cache is utilized the result of the conversion is cached so when the same value appears again on the input the cached result is used instead of performing the conversion again.

You can specify the number of entries in the date cache through the Cache size property. To avoid cache misses when looking up the values in the cache, set this property a value no smaller than the total number of distinct values expected to be encountered on the input per each Date and Timestamp link column.

You can also choose to disable the date cache if it becomes full during load by setting Disable cache when full property to Yes. One scenario where this can be useful is when you run the job repeatedly to load records to the target table. Most of the times the total number of distinct input values is smaller than the selected cache size but occasionally most of the input values are unique. In that case, disabling the cache when it becomes full will avoid performing lookups on the full date cache, which will improve the performance because each of the those lookups would result in a cache miss.

Note that each player process for the stage will use its own date cache when the stage is configured to run in parallel execution mode.

In Information Server Version 9.1 when Date and Timestamp input link columns are used to load values to DATE table columns instead of TIMESTAMP table columns, the values provided by the connector are in the format that does not require conversion prior to storing them in the target table. Therefore, the use of date cache feature will not provide much benefit in this case. In Information Server releases prior to Version 9.1, the conversion will be needed in this case so the use of date cache will provide benefits similar to when loading values to TIMESTAMP target table columns.

Disabling redo logging

Database redo log tracks changes made to the database and is utilized in the database recovery process. When the connector is loading data to the target table, the redo log is updated to reflect the rows loaded to the table. In case of the database storage media failure, after the database is restored from the most recent backup, the rows loaded by the connector since the backup was taken are restored from the redo log. The ability to restore loaded rows from the redo log leads to potential decrease in performance because the load operation takes longer to complete when the redo log needs to be maintained and capture the data loaded to the table.

To speed up the load operation performed by the connector you can disable the redo logging by setting the NOLOGGING attribute on the target table. You can set this flag when you create the table, or if you wish to set it only during the load you can do that by using the Before SQL statement and After SQL statement connector properties. In the Before SQL statement property specify the statement ALTER TABLE table_name NOLOGGING and in the After SQL statement> property specify the statement ALTER TABLE table_name LOGGING, where table_name is the name of the table to which the stage is loading data. Another option is to use the Disable logging connector property – when this property is set to No then the logging flag (LOGGING or NOLOGGING) is used, and when it is set to Yes, the redo logging is disabled.

Disabling redo logging for the bulk load can significantly improve the load performance. In case of the media failure the data that was loaded while redo logging was disabled can be restored from the database backup if the backup was taken after that load but it cannot be restored from the redo log. If restoring loaded data from the redo log in case of a media failure is not required, or if the data can be easily re-loaded in that case, then disabling redo logging should be considered for speeding up the load operation. Refer to the Resources section for the link to Oracle Database online documentation where you can find more information about redo log and considerations related to disabling the redo logging.

Loading records to a staging table

If you wish to use the connector in bulk load mode to quickly transfer the data to the target database but at the same time you need the target table triggers, indexes and constraints to be maintained as if a conventional insert statement was used, one option for you to consider is to initially load the records to an intermediate staging table that has the same definition as the target table, but does not contain any triggers, indexes and constraints. After the load to the staging table completes, issue a follow-up MERGE or INSERT statement that reads rows from the staging table and writes them to the target table. This statement can be specified in the After SQL statement connector property so that it is automatically invoked at the end of the job. The use of the staging table results will cause data to be written twice (once when loaded by the connector to the staging table and once when copied to the target table by the follow-up statement) but will, at the same time, have the following beneficial characteristics:

  • The records will be processed only once by the job. This is important when the processing of records in the job is resource intensive.
  • The records will be moved from the DataStage engine hosts to the database host only once.
  • The INSERT statement will be executing on the database side so during that time the DataStage engine hosts will be available to process other workloads.
  • You can use database SQL functions in the INSERT statement to perform additional transformation on the values that were loaded to the staging table prior to storing them to the target table.
  • If the INSERT statement that copies the values fails for some reason, the loaded records will remain available in the staging table and you can issue another INSERT statement after handling the rows in the staging table that have caused the error. You can utilize the LOG ERRORS INTO clause in the INSERT statement where the rows that cause the INSERT statement to fail are directed to the error logging table along with the details indicating the reason for their error.
  • The indexes, constraints and triggers on the target table will remain enabled, valid and enforced at all times.

Manual load

You can configure Oracle Connector stage to run in manual load mode in which case the stage will write records to a file on the engine tier host that can then be passed to the Oracle SQL*Loader utility to load the records to the target table. To configure the stage in this mode set the Manual load property to Yes. Loading records in manual load mode is generally slower than loading them directly to the target table but in some cases can be very useful, as shown next.

One example would be if you want to run the job to extract and process the data at a time when the target database is not available for load. You can configure the stage to store the data to a file and then later, when the database is available, you can load the data to it using the SQL*Loader utility. The data will be already processed at that time and ready to be stored to the target table. In cases, when the processing phase of the job takes a long time to complete, such as for example when pulling data from an external source that is slow or when complex and CPU intensive transformations are involved in the job, the manual load approach can result in significant time savings.

Note that the connector actually generates two files in this mode – the data file with the records to be loaded and the control file used to pass options to the SQL*Loader and inform it about the location and format of the data file. You can also transfer the generated files to a host that doesn't have DataStage installed and load the records from there, presuming that you have SQL*Loader tool available on that host. If that host is the Oracle server host the load operation will be local to the database.

Restarting the load and backing up the data that needs to be loaded is also convenient with this approach because the generated data file is already in the format supported by the SQL*Loader tool.

Note that when the stage is configured to run in manual load mode it will try to connect to the database even if it will not be loading records to the database on that connection. If the target database is not available to accept connections at the time when you run the job, you can work around the problem by pointing the stage to another database that is available at that time, so that the stage can establish a connection and complete the manual load operation.

If it is critical to ensure the stage does not connect to any database in manual load mode, you can accomplish that by ensuring that the following conditions are met:

  • The connector is not collecting operational metadata. To ensure this define the DS_NO_PROCESS_METADATA environment variable for the job and set it to value FALSE.
  • The connector does not issue statements on the database to be interpreted as events by the IBM InfoSphere Guardium Activity Monitor. To ensure this, ensure that you don't have CC_GUARDIUM_EVENTS environment variable set for the job.
  • The Run before and after SQL statements property must be set to No. This ensures that the connector is not trying to perform any SQL statements on the database prior to and after loading the records to a file.
  • The Table action property must be set to Append. This ensures the connector is not trying to create, drop or truncate a database table prior to performing the load operation.
  • The Perform operations before bulk load and Perform operations after bulk load properties must not specify any index and constraint maintenance operations.

When all of these conditions are met then the connector will load records to the data file without connecting to the database. You will still need to provide Username and Password property values though in order to be able to compile the job. You can set them to some dummy values or you can set the Use external authentication property to value Yes which will disable the Username and Password properties so you will not need to provide values for them.

Note that even if you configure the stage not to connect to the database, you will still need to have Oracle Client product installed on the DataStage engine host on which the connector runs, otherwise the connector will fail the initialization when the job starts and the job will fail.

Array size and Buffer size

When the Oracle Connector stage is configured to run in bulk load mode, it reads records from the input link and stores them in the Oracle specific direct path column array. The connector does not create this array in memory like the array that it uses for fetch and DML operations. Instead, Oracle client creates this array on behalf of the connector. The size of this array in number of records is specified through the Array size property. The default value is 2000.

When the column array becomes full the connector requests from Oracle client to convert the array to the internal stream buffer format and to send this buffer to the Oracle server. The size of this stream buffer (in kilobytes) is specified through the Buffer size connector property. The default value for this property is 1024 (1 megabyte).

Depending on the length of the records, it is possible that the specified Buffer size value will not be sufficient to store the Array size number of records in stream format. In that case, the Array size will be automatically reduced to a smaller value and Oracle Connector will log a message to indicate the newly enforced number of records that it will load at a time. Note that regardless of the specified Buffer size value the maximum Array size value will be 4096.

When configuring the Array size and Buffer size values in order to achieve optimal performance, you can apply the following approach:

  1. Set the Array size to 4096 (maximum supported) and Buffer size to 1024.
  2. Run the job and check if the connector reported a message in the job log that it will load less than 4096 records at a time. If it did then go to step 3. If it did not then go to step 4.
  3. Continue trying with increasingly larger Buffer size values until the performance worsens or the connector stops reporting the message, whichever takes place first.
  4. Continue trying with decreasingly smaller Buffer size values until the performance worsens.

Loading records with the stage in parallel execution mode

The connector can be configured to run in parallel execution mode and load data from multiple player processes with each player process loading a subset (partition) of input link records to the target table. In this case, parallel load must be enabled for the target table segment. To do this set the Allow concurrent load sessions connector property to value Yes. If this property is set to No and you attempt to load records from multiple player processes, you will receive Oracle error "ORA-00604: error occurred at recursive SQL level 1" possibly accompanied by the Oracle error "ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired".

The exception to the above rule is when the stage is loading rows to a Range, List or Interval partitioned table (which can optionally be subpartitioned) and the Partition type setting for the link is set to Oracle Connector. This exception scenario is covered in the next section which discusses loading records to a partitioned table.

Setting Allow concurrent load sessions property to Yes has implications on how indexes are handled during load. For more information refer to the earlier section which discussed the handling of indexes.

Loading records to a partitioned table

When you wish to load records to a single partition or subpartition of the target table, set the Table scope connector property to Single partition or Single subpartition value, and specify the name of the partition or subpartition in the Partition name or Subpartition name property. Doing this will result in locking only the segment of the specified partition or subpartition and will also return error if any input record needs to be stored to another partition, that way providing verification mechanism that the records belong to the partition or subpartition specified in the stage.

When you wish to load records to multiple partitions of a table, set the Table scope connector property to value Entire table. In respect to triggers, indexes and constraints defined on the table, the concepts that apply to non-partitioned tables apply in this case as well. When rebuilding indexes on partitioned tables, the connector will automatically rebuild only those index partitions or subpartitions that have been marked unusable during the load process. This applies to both locally and globally partitioned indexes.

When you wish to load records in parallel and make sure that each player process is loading records to a dedicated table partition, you can do that for a table partitioned based on Range, List or Interval partitioning strategy (and is optionally subpartitioned) by specifying the Oracle connector value for the Partition type setting for the input link. In this case, each player process of the stage will be associated with a single dedicated table partition and will load data to that partition's segment (or its set of subpartition segments if the table is composite-partitioned) and because no two player processes will be accessing the same segment, the Allow concurrent load sessions connector property can be set to No. The exception is when the table has a globally partitioned index. In that case, the job will fail with the Oracle error "ORA-26017: global indexes not allowed for direct path load of table partition string".

When considering using Oracle connector value for Partition type when loading records to a partitioned table, consult the list provided in Part 1 article in the section that covered DML operations on a partitioned table.


Reject link considerations

Reject links are a convenient mechanism for handling input records for which the database reported an error when it tried to perform the operation specified in the stage. You can enable reject conditions on the reject link under which the records in error should be rejected, you can choose to include error code and error text information with the rejected records and you can point the reject link to another stage to store the rejected records to a location of your choice, such as a file or another database table.

However, reject links can have negative effect on the performance. When the stage has a reject link defined, it needs to inspect the error code for each record reported by the database to be in error, compare it to the conditions defined on the reject link and submit the record to the reject link, potentially accompanied by the error code and error text information. These operations consume system resources and affect the job performance as a whole.

You saw earlier that in order to use reject links to handle records with constraint violations in bulk load write mode, you must run the stage in sequential execution mode. Doing this will result in rejecting all the rows from the table that violated the constraints, including those rows that were present in the table before the load. Also, when you run INSERT statements with the APPEND_VALUES optimization hint you cannot rely on the reject link functionality.

To avoid these limitations associated with the use of reject links, avoid using the reject links altogether and try to eliminate the records in error BEFORE they reach the Oracle Connector stage or to eliminate them AFTER the stage has completed the operation for which it was configured.

As an example of handling records in error BEFORE they reach the Oracle Connector stage, let us consider the NOT NULL and CHECK constraints in a table created using the statement shown in Listing 7.

Listing 7. Statement to create test table
CREATE TABLE TABLE_TEST(C1 NUMBER(10) CONSTRAINT CONSTR_NOT_NULL_C1 NOT NULL,
                        C2 VARCHAR2(20),
                        C3 DATE,
                        CONSTRAINT CONSTR_CHK_C1 CHECK (C1 > 20000 AND C1 < 70000));

The constraints on this table require that the values for field C1 are not NULL and that they are between 20000 and 70000 (exclusive). Instead of configuring the Oracle Connector to use a reject link for handling these constraints, the constraint checking can be done in the stages upstream of the Oracle Connector stage. For example, you can use the Filter stage with one output link leading to the Oracle Connector stage and another reject link leading to a Sequential File stage (or any other stage to which you wish to direct the bad records). In the Where Clause property of the Filter stage you would specify a condition as the one shown in Listing 8.

Listing 8. Filter stage condition
C1 IS NOT NULL AND C1 > 20000 AND C1 < 70000

You would also set the Output Rejects property of the Filter stage to value True. The Filter stage would pass to its output link the records that meet this constraint, and would send the remaining records to the reject link.

This configuration is shown in Figure 2.

Figure 2. Handling constraint violations with a Filter stage
Configuration with Filter removing bad records before they reach Oracle Connector stage.

For more complex constraint expressions or to annotate rejected records with the information describing the reason why they were rejected, you could use the Transformer stage. For the previous example, the Transformer stage would have one input link (for example, input_records) with columns C1, C2 and C3 and would have three output links. All three output links would copy the columns from the input link. The first output link would have the constraint IsNull(input_records.C1) defined and an extra column to contain the constant literal value "C1 value is null.". The second output link would have the constraint input_records.C1 <= 20000 Or input_records.C1 >= 70000 defined and an extra column to contain literal value "C1 value: " : input_records.C1 : " is not between 20000 and 70000.". The first and the second link would lead to the Funnel stage that would combine the records on them and store them to a reject file using the Sequential File stage. The third output link would lead to the Oracle Connector stage that would write them to the table. This configuration is shown in Figure 3.

Figure 3. Handling constraint violations with a Transformer stage
Configuration with Transfomer removing bad records before they reach Oracle Connector stage.

Click to see larger image

Figure 3. Handling constraint violations with a Transformer stage

Configuration with Transfomer removing bad records before they reach Oracle Connector stage.

The following are some examples of handling records with errors AFTER the stage has completed moving the records to the database:

  • When bulk loading to the target table with constraints, load the records to a staging table instead that doesn't have the constraints defined on it and then issue a follow up INSERT statement to move the records to the target table. Specify LOG ERRORS INTO clause in the insert statement so that the records in error are directed by to an alternate table. For more information, refer to the earlier section that covered loading of records to a staging table.
  • When bulk loading to the target table with constraints, configure the stage to disable the constraints before the load and enable them after the load, specify the exceptions table but do not configure the stage to automatically process the exceptions. Instead, process them manually after the job completes. Alternatively, perform custom handling of constraints and indexes using a strategy such as the one suggested earlier in the article.

Data types

In this section, we will look at how selection of data types for the link columns can affect the connector performance. The default link column data types are provided by the connector when it imports the table definition. They are also provided by the connector at runtime when the stage has an output link without columns defined on it and the runtime column propagation feature is enabled for the link.

The default link column data types represent the closest match for the corresponding table column data types in terms of the ranges of values that they support. In some cases, you may have an additional insight regarding the format and range of the actual data values processed by the stage which may in turn allow you to select data types for the link columns that result in better performance in the connector. In this section, a number of such scenarios are examined.

You should always start with the link columns of default data types. To determine the default data types, use the connector to import the table that you will be accessing from the stage and look at the column definitions in the imported table definition. If you configured the stage to issue a complex SELECT statement that references multiple tables then you will not be able to import a single table definition for that statement. What you can do in that case is create a temporary view based on that statement and then import the view instead. When your stage is configured to execute a PL/SQL anonymous block, you will not be able to import a table or a view to determine the default link column data types. In this case, you can analyze the PL/SQL block and for each bind parameter in the PL/SQL block (specified in :param_name or ORCHESTRATE.param_name format) determine the object in the database to which it corresponds. In some cases, this object will be a column in a table or a view, in some cases it will be a parameter of a stored procedure, but in all cases that object will have an associated Oracle data type. Create a temporary table with the columns of those same data types and import that table.

Refer to the Resources section for the link to the Information Server Information Center where you can find the table that shows how Oracle Connector maps Oracle data types map to DataStage data types by default.

Once you have placed columns of default data types on the link, run the job and measure the performance. Then apply the information from the remainder of this section to further optimize the job by changing link column data types in such a way so that the performance improves while no loss of data occurs.

Representing CHAR, VARCHAR2, NCHAR and NVARCHAR2 table column data types

In this section, we will look at the CHAR, VARCHAR2, NCHAR and NVARCHAR2 Oracle data types.

When you use Oracle Connector to import tables with columns of these four data types, the connector will model CHAR and VARCHAR2 table columns as DataStage Char and VarChar link columns, and it will model NCHAR and NVARCHAR2 table columns as DataStage NChar and NVarChar link columns. The exception to this rule is the case when CHAR and VARCHAR2 table column definitions are based on the character length semantics and the database character set is a multi-byte character set such as AL32UTF8. In that case, CHAR and VARCHAR2 columns are modeled as NChar and NVarChar link columns.

If your database character set is a multi-byte character set such as AL32UTF8 and your table columns use character length semantics, such as a column defined to be of VARCHAR2(10 CHAR) type, and all the actual data values use only single-byte characters, then use the Char and VarChar columns on the link (with Extended attribute left blank) instead of the NChar and NVarChar columns.

Minimize the character set conversions in the stage by ensuring that the effective NLS_LANG environment variable value for the job is compatible with the NLS map name specified for the job. This is important not just for the performance reasons, but for the correct functioning of the stage in regard to interpretation of character set encoding of the values. To determine the effective NLS_LANG value for the job, run the job with the CC_MSG_LEVEL environment variable set to 2 and inspect the job log. The connector will log the effective NLS_LANG environment variable value. If you wish to use a different value, you will need to define NLS_LANG environment variable for the job and set it to the desired value. Oracle Connector stage editor does not have the NLS tab so the NLS map name selected for the job is used for the stage. You can specify this value for the job by opening the job in the DataStage Designer, selecting the menu option Edit and then clicking Job properties and the NLS page in the dialog and providing the value in Default map for stages setting. Table 1 shows examples of compatible NLS map name and NLS_LANG values. Note that you should set the language and category portions of the NLS_LANG value to match the locale of your DataStage engine host.

Table 1. Compatible NLS map name and NLS_LANG values
NLS map nameCompatible NLS_LANG
UTF-8AMERICAN_AMERICA.AL32UTF8
Windows-1252AMERICAN_AMERICA.WE8MSWIN1252
ISO-8859-P1 GERMAN_GERMANY.WE8ISO8859P1
ASCL_ASCIIAMERICAN_AMERICA.US7ASCII
Shift-JISJAPANESE_JAPAN.JA16SJIS

If the values processed by the stage are based on only ASCII characters, set the NLS map name to ASCL_ASCII and set NLS_LANG to AMERICAN_AMERICA.US7ASCII. Otherwise if possible set the NLS_LANG value to match the database character set. For example, if the stage is writing records to a database with the AL32UTF8 database character set, use the NLS_LANG value of AMERICAN_AMERICA.AL32UTF8 and make sure the NLS map name for the job is set to UTF-8. To determine the database character set you can connect to the database from SQL*Plus and issue the statement shown in Listing 9.

Listing 9. Determining the database character set
SELECT VALUE FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER='NLS_CHARACTERSET';

Always make sure that the values in the job represented by the Char and VarChar link columns are actually encoded as indicated in the NLS map name and NLS_LANG settings and that the values for NChar and NVarChar link columns are UTF-16 encoded. Note that the connector handles Char and VarChar link columns which have the Extended attribute set to Unicode the same way it handles NChar and NVarChar link columns.

Make sure that all Char, VarChar, NChar and NVarChar link columns have Length attribute set to a concrete value. Leaving the Length attribute empty can have severe negative impact on the performance of the stage. When the Length is not specified, the connector makes assumption that the length is the maximum one allowed for the respective bind parameters, which is 4000 bytes. The connector will therefore allocate the memory with the assumed column lengths, which may result in large memory consumption. Suppose that the connector is configured to insert rows to a database table with the default Array size of 5000, and the link contains 50 VarChar columns that do not have the Length attribute set. The connector will allocate (in each player process) approximately the total of 50 (columns) x 5000 (array size) x 4000 (maximum value length) bytes which is close to 1 gigabyte. If the actual table columns in the database are defined as VARCHAR2(50 BYTE) then by setting the Length attribute of the link columns to 50, the Array size could be left set to 5000 and the size of the allocated memory per player process would drop to 50 (columns) x 5000 (array size) x 50 (maximum value length) bytes which is about 12 megabytes.

Do not use LongVarBinary, LongVarChar and LongNVarChar link columns to represent CHAR, VARCHAR2, NCHAR and NVARCHAR2 table columns. These link columns are appropriate for handling table columns of LONG, LONG RAW, BLOB, CLOB, NCLOB and XMLType columns as discussed later in the text. Even in those cases, if the actual values for those columns are less than 4000 bytes in size, the Binary, VarBinary, Char, VarChar, NChar and NVarChar link columns should be used instead of LongVarBinary, LongVarChar and LongNVarChar link columns. When LongVarBinary, LongVarChar and LongNVarChar link columns are used, the connector disables array processing and enforces the Array size value of 1, even if you specify small Length attribute values for those link columns. Enforcing the Array size of 1 can have a major negative impact on the performance.

Ensure that the connector is using optimal character set form for the Char, VarChar, NChar and NVarChar link columns. Character set form is an Oracle database concept and can be one of two values: IMPLICIT and NCHAR. The optimal character set form for the connector to use for CHAR and VARCHAR2 table columns is IMPLICIT and for NCHAR and NVARCHAR2 table columns it is NCHAR.

The connector will automatically use optimal character set form if you use Char and VarChar link columns for CHAR and VARCHAR2 table columns and NChar and NVarChar link columns for NCHAR and NVARCHAR2 table columns. As stated earlier, note that Char and VarChar link columns with Extended attribute set to Unicode are treated as NChar and NVarChar link columns.

In some cases, you may end up using link columns that result in sub-optimal character set form selection in the connector. For example, when you use the connector to import a table with VARCHAR2 columns that are based on character length semantics and the database is based on AL32UTF8 (multi-byte) character set, the connector will import that column as NVarChar link column. By importing it this way the connector is able to set the Length attribute of the link column to match the length specified in the table column definition while ensuring that data truncation does not happen at runtime. At runtime the connector will choose the character set form NCHAR for this link column which will not be the optimal choice for the VARCHAR2 target table column. If this link column happens to be referenced in a WHERE clause in the statement specified for the stage then any index defined on this table column will not be utilized. This applies to SELECT statements used by the stage configured for sparse lookup mode, as well as UPDATE and DELETE statements when the connector is configured to write records to the database in Insert, Update, Delete, Insert then update, Update then insert or Delete then insert write modes. The skipping of the index in this case can have significant performance implications. To force the connector to use the optimal character set form in this case, you could apply one of the following strategies:

  • Change the link column data type from NVarChar to VarChar. In this case, you will also need to update the Length attribute for this link column by multiplying it by 4 because UTF-8 characters can consume up to 4 bytes. As we mentioned before, if you know that all the actual data values processed by the stage use single-byte characters, then you can leave the Length attribute unchanged.
  • Change the data type of the table column to NVARCHAR2. In this case, the NCHAR character set form selected by the connector will be the optimal choice. In many practical cases though, making changes to the existing column definitions in the database tables is not an option.
  • Utilize the CC_ORA_BIND_FOR_NCHARS environment variable. Set the value of this environment variable for the job to contain comma-separated list of link column names for which you wish the connector to use NCHAR character set form. If you specify the special value (none) for this environment variable, then the connector will use IMPLICIT character set form for all columns on the link. Another special value is (all) (the parenthesis are part of the value) which will result in the connector using NCHAR character set form for all the columns on the link. If none of the tables in your database have NCHAR, NVARCHAR2 and NCLOB (discussed later) columns then you can define CC_ORA_BIND_FOR_NCHARS environment variable at the DataStage project level and specify (none) as its default value (again the parenthesis are the part of the value). That way the connector will always use IMPLICIT form which will always be the optimal character set form.

Representing NUMBER, FLOAT, BINARY_DOUBLE and BINARY_FLOAT table column data types

In this section we will take a look at the NUMBER, FLOAT, BINARY_DOUBLE and BINARY_FLOAT Oracle data types.

For NUMBER(p, s) table column where precision p and scale s are explicitly specified, the default and at the same time the optimal link column data type is Decimal with Length attribute set to p and Scale attribute set to s. The exception to this rule are the scenarios where s < 0 and where s > p. In those cases, the values need to be provided to the stage and retrieved from the stage in text format so the link column should be of VarChar type with the Length attribute value sufficiently large value to accommodate all possible values for the respective column in the table.

For NUMBER(p) table column where precision p is explicitly specified but scale s is omitted, the default link column data type is Decimal with the Length attribute set to p and Scale attribute left empty. If the remaining stages in the job are operating on integer (TinyInt, SmallInt, Integer or BigInt) values for these columns then you may be able to achieve better performance if you preserve those integer data types for the columns on the link of the Oracle Connector stage than if you change them to the Decimal data type. This is especially the case when the connector stage is writing to a partitioned table and the corresponding table column is used as part of the partition key of the table and you specified Oracle connector value for the Partition type setting of the link. If you decide to replace Decimal(p) link columns with integer link columns keep in mind the differences between these link data types. Although they all represent integer values, the ranges of values that they cover differ.

For example, if a table column is defined as NUMBER(5) and you choose to replace the default Decimal(5) link column with a SmallInt link column, the values supported by the NUMBER(5) table column will be in the range [-99999, 99999] and the values supported by the SmallInt link column will be in the range [-32768, 32767]. If you set the Extended attribute of the link column to Unsigned, the values that are supported by the link column will be in the range [0, 65535]. In both cases, the range of the link column will be a sub-range of the respective table column. So you should only use this approach if you are sure that all the values that will be processed by the stage belong to both ranges. In the previous example, you may choose to use SmallInt link column (signed or unsigned) when writing values to the database, but to use Integer column (signed) when reading values from the database because the range of the supported values for the Integer (signed) link column is [-2147483648, 2147483647] and is therefore sufficient to support any value supported by the NUMBER(5) table column. Note that if you choose to use BigInt column on the link, the range of supported values for the link column will be [-9223372036854775808, 9223372036854775807]. And if the Signed column attribute is set to value Unsigned the range will be [0, 18446744073709551615]. However, the choice of BigInt link column will likely result in smaller performance gains than if TinyInt, SmallInt or Integer link column is used, and may even produce worse performance than the Decimal link column. This is because the connector will exchange BigInt link columns values with the database as text values which will involve additional data type conversions.

For the NUMBER table column where both precision p and scale s are omitted, the default link column data type is Double. But if you are certain that the actual values in the table are all integer values, you may choose to use link column of one of the integer data types (TinyInt, SmallInt, Integer, BigInt) or you may choose to represent the values as decimal values with a specific precision and scale by using a link column of Decimal data type. Keep in mind that the supported ranges of these link column data types are not fully compatible so only use these link column data types in cases when they will not result in the loss of data precision due to the rounding or in cases when such a rounding is acceptable.

For FLOAT, BINARY_DOUBLE and BINARY_FLOAT table columns, the default and optimal column types to use for the corresponding link columns are Float, Double and Float, respectively.

Representing DATE, TIMESTAMP and INTERVAL table column data types

In this section we will take a look at the DATE, TIMESTAMP and INTERVAL Oracle data types.

For DATE, TIMESTAMP(0), TIMESTAMP(0) WITH TIME ZONE and TIMESTAMP(0) WITH LOCAL TIME ZONE table columns the default link column data type is Timestamp with Extended attribute left empty. This is the optimal link column type to use in case when hour, minute and second portions of the values need to be preserved. Otherwise, the Date link column will likely result in better performance because the stage does not need to handle hours, minutes and seconds in the values. Note that if you need to preserve time zone information in the values you need to use character based link columns such as VarChar. This is covered in more detail later in this section.

For TIMESTAMP(p), TIMESTAMP(p) WITH TIME ZONE and TIMESTAMP(p) WITH LOCAL TIME ZONE table columns where precision p > 0, the default link column data type is Timestamp with Extended attribute set to Microseconds. The processing of Timestamp link columns with Extended attribute set to Microseconds takes considerably more time in the Oracle Connector than the processing of Timestamp link columns which have the Extended attribute left empty. This is especially the case for SELECT and DML (INSERT, UPDATE, DELETE) statements and to a lesser extent bulk load operation. Avoid the use of Timestamp link columns with Extended attribute set to Microseconds unless you absolutely need to support fractional seconds in the timestamp values.

The connector models TIMESTAMP(p), TIMESTAMP(p) WITH TIME ZONE and TIMESTAMP(p) WITH LOCAL TIME ZONE table columns with p > 6 as Timestamp link columns with Extended attribute set to Microseconds. The connector will in this case perform truncation of fractional seconds to the microsecond precision. To handle timestamp values with fractional second precision greater than 6 up to and including 9, use the VarChar link column. In that case, the connector will exchange timestamp values with Oracle client as text values. Ensure in this case that the NLS_TIMESTAMP_FORMAT session parameter is set to match the format of the actual values provided to the stage on its input link. The same format will be used by the stage for the values that it provides on its output link. You can set this session parameter using the Before SQL (node) statement property, by issuing an ALTER SESSION statement. For example, to handle timestamp values as text values with 9 fractional digits, a statement like the one shown in Listing 10 could be specified in the Before SQL (node) statement property.

Listing 10. Setting the timestamp format with 9 fractional seconds for the session
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF9'

You can apply this same approach for TIMESTAMP(p) WITH TIME ZONE, INTERVAL YEAR (yp) TO MONTH and INTERVAL DAY(dp) TO SECOND(sp) table columns. For TIMESTAMP(p) WITH TIME ZONE table columns set the NLS_TIMESTAMP_TZ_FORMAT session parameter to the appropriate format, which includes the time zone format if the time zone needs to be preserved in the values. The INTERVAL(yp) TO MONTH and INTERVAL DAY (dp) to SECOND(sp) values should be specified as interval literals in the format defined in the Oracle database. Refer to the Resources section for the link to Oracle Database online documentation where you can find more details and examples for INTERVAL Oracle data types and interval literals.

If you are using Timestamp (with Microseconds) link columns because you need to preserve microseconds in timestamp values, you can again apply this same approach as it can potentially lead to improved performance. For example if you are fetching values from a TIMESTAMP table column and the downstream stage is capable of handling timestamp values in text format, then replace the Timestamp (with Microseconds) column on the output link of Oracle Connector stage with a VarChar column so that the connector provides timestamp values in text format. Likewise, if you are inserting values to a TIMESTAMP table column and the upstream stage can provide the timestamp values in text format, replace the Timestamp (with Microseconds) column on the input link of Oracle Connector stage with a VarChar column so that the connector accepts timestamp values in text format. The Length attribute of the VarChar column and the NLS_TIMESTAMP_FORMAT session variable must be appropriate for the format of the timestamp text values. For example, to use the default YYYY-MM-DD hh:mm:ss.ffffff format for the timestamp values, set the Length attribute of the VarChar column to 26 and specify the statement shown in Listing 11 in the Before SQL (node) statement connector property.

Listing 11. Setting the timestamp format with 6 fractional seconds for the session
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF6'

Even in the cases when the remaining stages of the job are not capable of handling timestamp values with microseconds in text format and require that the Timestamp (with Microseconds) link column is used, you can choose to map this column to a VarChar column explicitly for the Oracle Connector stage by using a Transformer stage as shown below:

  • If the link column is located on the output link of the Oracle Connector stage and the link name is out_link, insert a Transformer stage between the Oracle Connector stage and the next stage in the job flow and map the VarChar(26) column on the input link of the Transformer to the Timestamp(with Microseconds) column on the output link of the Transformer stage using the derivation shown in Listing 12.
    Listing 12. Converting string to timestamp
    TimestampToString(in_link.C1,"%yyyy-%mm-%dd %hh:%nn:%ss.6")
  • If the column is located on the input link of the Oracle Connector stage and the link name is in_link, insert a Transformer stage between the previous stage in the job flow and the Oracle Connector stage and map the Timestamp(with Microseconds) column on the input link of the Transformer stage to the VarChar(26) column on the output link of the transformer stage using the derivation shown in Listing 13.
    Listing 13. Converting timestamp to string
    TimestampToString(in_link.C1,"%yyyy-%mm-%dd %hh:%nn:%ss.6")

Representing RAW table data type

In this section we will take a look at the RAW Oracle data type.

Use VarBinary link columns to read and write RAW table column values in their native raw binary form. For the Length attribute of the link columns specify the sizes of the respective RAW table columns.

Use the VarChar link columns if you need to read or write binary values as sequences of hexadecimal digit pairs. In this case, make sure to set the Length attribute of the links columns to a value twice the size of the respective RAW table columns because each byte in a RAW table column value will be represented by a pair of hexadecimal digit (0-9 and A-F) characters. The use of VarChar link columns will allow you to read and write RAW table column values in hexadecimal digit pair's form without a need to handle the conversions in another stage in the job.

Representing LONG, LOB and XMLType table column data types

In this section we will take a look at the LONG, LONG RAW, BLOB, CLOB, NCLOB and XMLType Oracle data types.

The preferred link column data types for these table column data types are shown in Table 2.

Table 2. Preferred link column data types for LONG, LOB and XMLType table column data types
Table column data typePreferred link column data type
BLOB, LONG RAWLongVarBinary
CLOB, LONG, XMLTypeLongVarChar
NCLOBLongNVarChar

You can use other combinations of link and table column data types as well but doing that may negatively affect the performance and will likely require you to use additional data type conversion functions in your SQL statements in order to avoid data type mismatch errors.

Every time you have LongVarBinary, LongVarChar or LongNVarChar columns defined on the link the connector enforces Array size value of 1 for that link which has negative effect on performance. This is the main reason why LongVarBinary, LongVarChar and LongNVarChar link columns should only be used when handling the table column data types covered in this section.

When the values represented by the LongVarBinary, LongVarChar and LongNVarChar link columns are 4000 bytes long or less, you should use VarBinary, VarChar and NVarChar link columns instead, respectively, even when the corresponding table columns are of the data types discussed in this section. When you do that the connector will not enforce the Array size value of 1 which will typically result in significant performance improvements. Also the guidelines presented earlier for character and binary table column data types will apply in that case instead of the guidelines presented in the remainder of this section.

When you configure the connector stage to fetch values from a table column represented by a LongVarBinary, LongVarChar or LongNVarChar link column, you have two choices for how the connector should pass the fetched values for that column to the output link. You can configure the stage to pass the values inline which means that each value is passed in its original form to the output link. Alternatively, you can configure the stage to pass the values by reference which means that a locator string representing the location of each value is passed instead of the actual value. The actual value represented by the locator string is then fetched by the last downstream stage in the job that stores this value to its intended destination. The locator string contains information that specifies how to retrieve the actual value from its source.

By default, the Oracle Connector stage passes values inline. To pass the values by reference you need to set the Enable LOB references connector property to Yes, and then in the Columns for LOB references connector property specify the names of the input link columns for which you wish the connector to pass the values by reference. You can use the following guidelines when deciding to pass values by reference or inline for the LongVarBinary, LongVarChar and LongNVarChar link columns:

  • Do the values only need to be passed through the job or do they also need to be interpreted (parsed or modified) by certain stages in the job? In the latter case, do not pass the values by reference because it is not the values that are passed in that case but the locator strings that represent them.
  • Which stage in the job is the consumer of the values? Only stages of connector-based stage types such as WebSphere MQ Connector, DB2 Connector, ODBC Connector, Teradata Connector and Oracle Connector are capable of recognizing locator strings and resolving them to actual values. Stages of other stage types will treat locator strings as the actual field values. For example, you can use Oracle Connector stage to pass CLOB value by reference to a DB2 Connector stage to store the value to a CLOB column in the DB2 database. But if you pass that value by reference to a Sequential File stage, the locator string will be stored in the target file and not the actual CLOB value. You can use the latter scenario to check the size of the locator strings so that you can compare them to the average size of the actual values and estimate the savings in terms of the amount of data transferred through the job when the locator strings are passed instead of the actual values. Typically the locator strings will be less than 2000 characters in length.
  • How large are the actual values? The larger the values are the more likely that passing them by reference will result in performance gains. As a rule of thumb values that measure about 100 kilobytes should be passed inline otherwise they should be passed by reference. Understanding the specifics of your environment in combination with the actual testing that you perform should help you with this decision. For example, if the Oracle Connector stage that reads the value from the database runs on a DataStage engine host which is different from the host in which the downstream stage consumes those values, then passing values by reference may provide performance benefits even for smaller actual values (as long as they are still larger than the locator strings) because they will be need to transferred over the network as they get exchanged between the player processes of the respective stages.

Note that DataStage imposes a limit on the size of records transferred between the player processes in the job. The APT_DEFAULT_TRANSPORT_BLOCK_SIZE built-in DataStage environment variable can be used to control this limit and by default it is set to 131072 bytes (128 kilobytes). For example, if you pass link column values inline and the values consume hundreds of kilobytes then you will need to override this environment variable and set it to a sufficiently large value, otherwise you will receive error message at runtime indicating that the record is too big to fit in a block. Set the value for this environment variable to accommodate the largest record that will be passed through the job. Do not set it to a value larger than necessary because the value will apply to all the player connections on all the links in the job so the increase of this value may lead to a significant increase of the memory usage in the job.

Oracle Connector stage handles values represented by LongVarBinary, LongVarChar and LongNVarChar link columns as follows:

  • For the read mode (fetching) it utilizes one of the following two Oracle mechanisms - piecewise fetch or LOB locators. The connector inspects the source table column data type and automatically selects the mechanism that is more suitable for that data type.
  • For the bulk load write mode it utilizes the Oracle direct path interface when loading large values. Again the connector does this automatically and there are no additional options to consider.
  • For the write modes other than bulk load it utilizes one of the following two Oracle mechanisms - piecewise inserts/updates or OCI LOB locators. By default it uses piecewise inserts/updates. To force the connector to use OCI LOB locators for some of the link columns, specify those link column names in a comma-separated list and set that list as the value of the CC_ORA_LOB_LOCATOR_COLUMNS environment variable for the job. To use OCI locator string mechanism for all LongVarBinary, LongVarChar and LongNVarChar link columns, specify the special value (all) for this environment variable (the parenthesis are included with the value). Note that for the link columns that represent LONG and LONG RAW target table columns the connector must be configured to use piecewise insert/update mechanism. Also note that for the link columns that represent XMLType target table columns the connector must be configured to use OCI locator string mechanism if the values are larger than 4000 bytes. Do not confuse LOB locators with the locator strings used when passing values by reference. The former is an Oracle mechanism for accessing large values in the database, the latter is a DataStage mechanism for propagating large values through the job.

The character set form and character set conversion considerations that affect the performance and that were presented earlier in the section which covered CHAR, VARCHAR2, NCHAR and NVARCHAR2 table column data types apply here as well. Simply assume the use of LongVarChar and LongNVarChar link columns instead of the VarChar and NVarChar link columns as you take those considerations into account.

As we saw in this section, to populate target table with records that contain LongVarBinary, LongVarChar and LongNVarChar link columns you have three different options: bulk load mode, insert mode with piecewise insert mechanism and insert mode with OCI LOB locator mechanism. The three modes are available when the values arrive to the stage as inline values as well as when they arrive as locator strings (values passed by reference). Typically, the bulk load mode will have the best performance, followed by the insert mode with piecewise insert mechanism and then the insert mode with OCI LOB locator mechanism. Depending on the specifics of your environment this may not always be the case and one way to determine which option provides the best performance is to run a series of tests and compare the results. The exception is the case when the target table has columns of XMLType data type in which case bulk load mode should not be considered if BINARY XML or OBJECT RELATIONAL storage type (as opposed to CLOB storage type) is used for those XMLType columns or if the table is defined as a XMLType object table.

When fetching, inserting or updating XMLType values, utilize the Oracle XMLType package functions GETCLOBVAL, GETBLOBVAL and CREATEXML. Refer to the Resources section for the link to the Information Server Information Center where you can find example statements for accessing XMLType columns with the connector.


Conclusion

In this article we covered various aspects of performance tuning of the Oracle Connector stage in DataStage jobs. We discussed running Oracle Connector in bulk load mode taking into account data volume considerations, handling table triggers, indexes, and constraints. We also looked at reject link considerations for handling input records that have errors. Finally, we went over how the data types for link columns can affect the connector performance.

In combination with the information listed in the Resources section and the information you collected through your own research, this article should help you achieve optimal performance of Oracle Connector stages in your DataStage jobs.


Acknowledgments

Many thanks to the following contributors for their valuable input and review of this article:

  • Paul Stanley, InfoSphere Senior Architect
  • Tony Curcio, InfoSphere Product Manager
  • Fayaz Adam, InfoSphere Staff Software Engineer

Resources

Learn

Get products and technologies

  • Evaluate IBM software products in a way most convenient for you: by downloading trial product versions, by trying products online for few hours in a SOA sandbox environment or by accessing products in a cloud computing environment.

Discuss

  • Connect with other developerWorks users in developerWorks Community as you explore developer oriented blogs, forums, groups, wikis and more.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=863303
ArticleTitle=Tuning the Oracle Connector performance in IBM InfoSphere DataStage, Part 2: Optimization of bulk load operation and considerations for reject links and data types
publish-date=04042013