In the Part 1 article in this series we provided a brief overview of the IBM InfoSphere DataStage product in the IBM Information Server suite of products and explained the role of the Oracle Connector stage in DataStage jobs. We explained the difference between conductor and player processes for the stage. We introduced the concepts of parallelism and partitioning and explained how their meaning differs between the DataStage environment and the Oracle database environment. We covered in detail the performance tuning and troubleshooting of the Oracle Connector stage configured to perform SQL statements on the database, including query statements used to fetch and lookup rows in the database as well as DML and PL/SQL statements used to insert, update and delete rows in the database.
In the Part 2 article of the series we will continue our coverage of the Oracle Connector performance tuning. First we will focus on performance aspects of the bulk load mode in the connector. In this mode the connector is used to load data to the database by utilizing the Oracle direct path interface. We will cover the operation of loading the records to the database as well as the operations that often need to be performed on the table indexes and constraints before and after loading the records. We will then present a few guidelines regarding the use of reject links with the connector. We will show how reject links can affect the connector performance in some cases and present some alternative approaches for your consideration. We will end the article by presenting you with a number of performance tuning strategies oriented towards the handling of different Oracle data types in the connector.
As was the case with the Part 1 article, the Part 2 article assumes the use of DataStage parallel jobs and Information Server Version 9.1, although some of the presented concepts would apply also to DataStage server jobs as well as the earlier Information Server versions.
When the Oracle Connector stage is configured to run in bulk load mode it utilizes the Oracle direct path interface to write data to the target database table. It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks. To configure the stage to run in bulk load mode the Write mode property of the stage needs to be set to value Bulk load. In this section, we will look at the connector properties and environment variables that play important role for tuning bulk load operation performance in the stage, as shown in Figure 1.
Figure 1. Connector properties that affect bulk load performance
Running the stage in Bulk load write mode typically results in better performance than when running it in Insert write mode. However, Bulk load write mode imposes several restrictions that are not present when the Insert write mode is used, especially when running the stage in parallel execution mode with Allow concurrent load sessions connector property set to value Yes. These restrictions are primarily related to the handling of triggers, indexes and constraints defined on the target table.
The information in the remainder of this section is intended to help you answer the following questions:
- Should I configure the stage to run in Bulk load or Insert write mode?
- If I choose to run the stage in Bulk load write mode, how should I configure the remaining settings in the stage in order to achieve optimal performance?
Data volume considerations
When deciding to configure the stage in Bulk load or Insert write mode, the critical aspect to consider is the data volume, in other words how many records on average is the stage writing to the target table in a single job invocation.
If the data volume is relatively small, the preferred option would be to use Insert write mode due to more flexibility that this write mode provides in respect to maintaining indexes, constraints and triggers of the target table. With the Insert write mode all the table indexes, constraints and triggers are enforced during the job execution and remain valid after the job completes. For example, if the stage that is running in Insert write mode tries to insert a record to the target table that violates the primary key constraint in the table, the stage can be configured to handle this scenario and to continue processing the remaining records in a number of ways that are not readily available when the stage is running in Bulk load write mode.
- It can have a reject link defined and configured to process records that have violated constraints. Each input record that violates table constraints will be routed to the reject link, with optionally included error code and message text explaining why the record was rejected. The existing row in the table for which the primary key was violated will remain in the table.
- It can be configured to run in Insert then update write mode in which case the existing row in the target table will be updated with the values from the record that violated the constraint.
- It can be configured to run in Insert new rows only write mode, in which case the record that violated the constraint will be skipped (ignored).
- It can be configured with the
INSERTstatement that contains
LOG ERRORS INTOclause which will result in Oracle redirecting records that violated the constraint along with the error information to the specified error logging table.
If the data volume is relatively large and the primary goal is to append the data to the target table as quickly as possible then the Bulk load mode should be considered.
The question is what should be considered a small data volume as opposed to large data volume. This decision will depend upon other things like how much performance you are willing to sacrifice when you choose the Insert write mode, which in turn depends on the size of the batch window that is available to push the data to the database and the available system resources. As a rule of thumb, thousands of records would likely be considered a small data volume, while millions of records would likely be considered a large data volume.
If the table does not have any indexes, constraints and triggers defined then Bulk load write mode would make a good candidate irrespective of the data volume involved because many of the restrictions imposed by this write mode would not apply any more. This includes the scenario where the table does in fact have indexes, constraints and triggers defined but you have procedures in place to disable or drop them prior to loading the data and then enable or rebuild them after the load is complete.
A simple test like the following can help you decide which write mode to choose:
- Design a test job with Oracle Connector stage that has an input link. In the test job provide records similar in volume and structure to the records that will be processed by the actual job in production.
- Ensure that the target database table used in the test job has the same definition as the table that will be used in production. If the production job performs incremental data loads to an existing table in the database, ensure that the table used for the test job is initially populated with a similar number of rows as the table that will be used in production.
- Run the job a few times in insert mode and capture the average time it took the job to complete.
- Manually disable any indexes, constraints and triggers on the target table and repeat the job a few times with the connector stage configured in bulk load write mode and capture the average time it took the job to complete.
- Compare the results from the previous two sets of job runs. If the numbers are similar that would be an indication that the Insert write mode is sufficient. If bulk load completes much faster, determine which additional operations need to be performed on the table to restore the constraints and indexes. Issue these operations from a tool like SQL*Plus to see how much time they take to complete. If that time combined with the time to bulk load the data is still considerably shorter than the time that it took to insert the data, then the bulk load mode is likely to be a better choice.
If the target table has triggers that need to be fired for every row written to the table then the bulk load mode should not be used because database triggers are not supported in this mode. To bulk load data to a table that has triggers, the triggers need to be disabled prior to the load. The connector provides an option to disable the triggers automatically before to load and enable them after the load. Once the triggers are enabled they will fire only for the rows inserted to the table after that point. They will not fire for the rows that were loaded while the triggers were disabled.
If the table has indexes, you can configure the stage to disable all the indexes
when performing the load. To do this set the Index maintenance connector
property to value Skip all. This is typically the best option to
choose as it allows you to run the stage without restrictions in parallel execution
mode and with any Partition type value selected on the
Partitioning page for the input link. However, all indexes are marked
UNUSABLE after the load completes and should be rebuilt. This is
where the data volume and the number of indexes play a critical role. If you are
incrementally loading a relatively small number of records compared to the number of
records already present in the table, it may be significantly faster to run the
stage in Insert write mode and have the database maintain indexes
throughout the insert operation instead of having to rebuild all the indexes after
bulk load. On the other hand, if the target table is empty and you need to load many
records to it then you will likely be benefited in using the Bulk load
write mode because the indexes will need to be built on all the rows, and in that
case, it may be faster to first load all the records without maintaining the indexes
and then build the indexes from scratch.
Instead of rebuilding the indexes in an external tool after the load, you can configure the stage to request from Oracle database to maintain the indexes during the load by setting the Index maintenance option property to Do not skip unusable or Skip unusable. The use of these two options is highly restrictive though and in most cases will not be the most optimal choice. Here are the things that you need to keep in mind if you consider using one of these two options:
- You will need to set Allow concurrent load sessions connector property to No, otherwise you will receive Oracle error "ORA-26002: Table string has index defined upon it" before any data is loaded. This requirement further implies that you will not be able to run the stage in parallel execution mode, except for one special case when loading to a partitioned table which is covered later in the article in the section that talks about loading records to a partitioned table.
- If the table has a
UNIQUEindex, the index will remain
VALIDafter the job completes but only if none of the loaded rows have violated the uniqueness imposed by the index. Otherwise the index will automatically be marked
UNUSABLEeven though the job completes successfully. Subsequent attempts to run the job will result in failure and the reported Oracle error depend on the Index maintenance option property value. For Do not skip unusable value the error will be "ORA-26028: index string.string initially in unusable state" and for Skip unusable value the error will be "ORA-26026: unique index string.string initially in unusable state".
- If the table has a
NON-UNIQUEindex, the index will remain
VALIDafter the load. If the index was marked
UNUSABLEbefore the load, the behavior will depend on the Index maintenance option property value. For Do not skip unusable value it will result in Oracle error "ORA-26002: Table string has index defined upon it" and for Skip unusable value the job will complete and the index will remain in
You can configure the stage to automatically rebuild indexes after the load, and you
can control the
PARALLEL clauses of the index
rebuild statements to optimize the index rebuild process. You can configure the
stage to fail the job if any index rebuild statement fails or to only issue warning
messages. To improve the performance the connector stage will rebuild the indexes
only when necessary. It will inspect the state of each index before rebuilding it.
If the index is marked
VALID the connector will skip rebuilding it.
This applies also to locally and globally partitioned indexes. In that case, the
connector will inspect the index partitions and subpartitions (if the index is
locally partitioned and the table is composite-partitioned) and rebuild only those
index partitions and subpartitions that are marked
The presence of integrity and referential constraints on the table has implications on the bulk load operation similar to those imposed by the indexes.
NOT NULL and
CHECK constraints do not need to be disabled
prior to the load. If any input records violate the constraint the job will fail and
no records will be loaded to the table. The exception to this is the case when the
stage is configured to run in parallel execution mode and some player processes
finish loading their record partitions before the player process that encountered
the record in error. In that case, the records loaded by the player processes that
have already finished will be committed to the table. From the performance point of
view, a good approach for dealing with these two types of constraints is to leave
them enabled on the table during the load but ensure that no records violate them by
performing the necessary checking and error handling upstream of the connector
stage. We provide some concrete suggestions later in the section that addresses reject links.
constraints also do not need to be disabled prior to the load but are effectively
ignored by the load operation. Any records that violate these constraints will still
be loaded and the constraints will remain
ENABLED. To ensure that these constraints are valid for the rows
loaded by the connector stage you will need to disable them prior to the load and
enable them after the load. Keep in mind that these types of constraints are usually
accompanied by indexes. The database enforces
KEY constraints through unique indexes and
constraints often have user-defined indexes created for them. In the earlier section
about handling indexes we explained the effects that indexes
have on the load operation. The same effects apply in this case and need to be taken
The connector provides an option to disable constraints before the load. You can specify this by setting the Disable constraints property to Yes. When this is done, the load operation typically completes faster than the insert operation. However, you have to enable and validate the constraints following the load and deal with the loaded records that have violated the constraints. That part can have a major impact on the overall performance. If you have a mechanism in place outside of the DataStage job to enable and validate the constraints after the load, then having the connector disable them before the load will provide for the best performance of the job itself since there will be no restrictions during the load due to constraints and no time will be spent at the end of the job to enable the constraints.
You can configure the stage to enable and validate the constraints after the load by
setting Enable constraints property to Yes. But, if any
constraint fails the validation at that time, the job will fail and the constraint
will remain in
NOT VALIDATED state. You can
specify the exceptions table in the Exceptions table name property to which
the ROWIDs of the rows that violated the constraints will be stored, along with the
name of the constraint that they have violated. Doing this will still result in the
job failure if any of the constraints cannot be enabled, but you will be able to
determine which rows violated which constraint. You can further configure the stage
to delete the rows from the table that violated the constraints and to try to enable
the constraints again but that will discard all the rows in the table found to be in
violation of the constraints. To store them somewhere instead of discarding them you
can define a reject link for the stage and enable SQL error – constraint
violation condition on it. This will have a major impact on the performance
because it enforces sequential execution of the stage in a single player process.
Rather than accepting this restriction you should consider running the stage in
parallel execution mode and specifying Insert value for the Write
mode as that will potentially result in better overall performance. In
general, it is best to avoid the use of reject links if possible. The section about
reject links provides some concrete ideas in this
If you configure the stage to disable table constraints prior to the load and enable
them after the load and reject the records that have violated constraints, all of
the records in violation of
constraints defined on the table will be rejected. The connector doesn't make
distinction between the rows that existed in the table prior to load and those that
were loaded by the connector. For example, if a row loaded by the connector violated
the primary key constraint because another row with the same primary key value
existed in the table prior to the load, both rows will be deleted from the table and
sent to the reject link. To have the stage delete and reject only those rows in
violation of constraint that were not in the table prior to the load, you will need
to implement a custom solution such as the one shown in the next section.
Custom handling of indexes and constraints
When the connector stage is configured to handle triggers, indexes or constraints
automatically before and after the load, the connector will accomplish this by
issuing various Data Definition Language (DDL) statements on the database. For
enabling and disabling triggers, it will use
ALTER TRIGGER statements,
for rebuilding indexes it will use
ALTER INDEX statements and for
enabling and disabling constraints it will use
ALTER TABLE statements.
You can locate these statements in the job log when you run the job with
CC_MSG_LEVEL environment variable set to value
connector provides properties that you can use to include or remove particular
clauses in these statements. For example, it provides properties that allow you to
control the presence of
LOGGING clauses in
ALTER INDEX … REBUILD statement.
In some cases, you may wish to issue your own statements for handling triggers, indexes and constraints instead of having the connector automatically generate and run them. For example, you may wish to reference table triggers, indexes and constraints directly by their names instead of having the connector query the database dictionary views to dynamically discover those objects. You may wish to manually include certain clauses in these statements that you cannot configure through the connector properties. You can achieve this outside of the stage by running external tools and scripts before and after running the job. You can do it through the stage by utilizing the Before SQL statement and After SQL statement connector properties. In each of these two properties you can specify either a list of semicolon-separated SQL statements or an anonymous PL/SQL block to be executed. The connector stage executes these statements in the conductor process. It will run the Before SQL statement statements before any player process has been created for the stage to process the records. It will run the After SQL statement statements after all the player processes have finished loading the records. The following example illustrates how this can be accomplished.
Suppose that you want to load records to the table
stores information about certain type of items. The table has the
column that represents item identifiers and serves as the primary key for the table,
which further means that the table has a unique index defined on this same column.
The table also has the
NAME column that stores item names and for which
a non-unique index is defined on the table. Finally, the table has the
LOAD_DATE column which represents the date and time when the item
row was loaded. The statement shown in Listing 1 can be
used to create this table.
Listing 1. Create table statement example
CREATE TABLE TABLE_ITEMS(ID NUMBER(10), NAME VARCHAR2(20), LOAD_DATE DATE); ALTER TABLE TABLE_ITEMS ADD CONSTRAINT PK_TABLE_ITEMS_ID PRIMARY KEY (ID); CREATE INDEX IDX_TABLE_ITEMS_NAME ON TABLE_ITEMS(NAME);
The load date is set when the job runs and is passed to the job as a job parameter
YYYY-MM-SS HH24:MI:SS. Various mechanisms can be used to set
and pass this value to the job. For example, if you are invoking the job with
dsjob command from command line, the system date command can be
used to produce this value. For example, on Linux you can use the command shown in
Listing 2 to run the job and pass the
LOAD_DATE job parameter value based on the current system date and
Listing 2. dsjob command invocation
dsjob -run -jobstatus -userstatus -param LOAD_DATE="`date +'%F %T'`" project_namejob_name
Another option for passing the current date and time would be to create a sequence
job that contains a single activity stage which in turn invokes the load job and
LOAD_DATE job parameter to value
DSJobStartTimestamp, which is a built-in DataStage macro that
returns the job start timestamp.
Suppose the table contains many rows already and that you are running your job to
load additional rows. Further, most of the records that you are loading are for new
items, but some records represent existing items, meaning they have the
ID field value for which there is already a row in the table with
ID value. You want to load the data as fast as possible, and
you also want the constraints and indexes to be enabled and valid for all the table
rows after the load. If any loaded rows violate the primary key constraint, you want
to store them to an alternate table
TABLE_ITEMS_DUPLICATES which has
the same columns as the
TABLE_ITEMS table but does not have any
constraints and indexes defined on it, as shown in Listing 3.
Listing 3. Create table statement for duplicates
CREATE TABLE TABLE_ITEMS_DUPLICATES(ID NUMBER(10), NAME VARCHAR2(20), LOAD_DATE DATE);
You could achieve this task with the stage configured to disable constraints before
the load, skip all indexes during the load and rebuild indexes and enable
constraints after the load. You could define reject link on the stage, select the
SQL error – constraint violations condition on the reject link and
direct this link to another Oracle Connector stage configured to insert rows to the
TABLE_ITEMS_DUPLICATES table. But this approach would have the
- Because you configured the stage to disable the constraints and reject rows that violated constraints after enabling them, the connector enforces sequential execution in a single player process. The rejecting of records on the reject link can only take place in a player process and in this case the rejecting is done when the connector enables the constraints after the load. For the player process to enable the constraints after it has finished loading the records, it needs to be certain that it is the only player process for the stage. But in practice, you may wish to run the stage in parallel execution mode so that you can utilize the available system resources on the engine tier.
- The connector rejects all the rows that violated the primary key constraint
including the rows that existed in the table prior to the load but that happen
to have the same
IDvalue as some of the rows that were loaded. This is because at the time the constraints are enabled all the rows are already in the table storage and no distinction is made at that time between the rows that were in the table prior to the load and those that were just loaded. In practice, you may wish to reject only the rows that were just loaded by the job and that violated the constraint but not any of the rows that existed in the table prior to running the job.
The following is one possible approach to avoid these limitations. Specify the statement shown in Listing 4 in the Before SQL statement property.
Listing 4. Statement to disable PK constraint
ALTER TABLE TABLE_ITEMS DISABLE CONSTRAINT PK_TABLE_ITEMS_ID;
This statement will explicitly disable the constraint
the table prior to loading records to the table.
Set the Index maintenance option connector property to Skip all. Set the Allow concurrent load sessions connector property to value Yes. Do not define reject link for the stage. Configure the stage to run in parallel execution mode.
Define the exceptions table
TABLE_ITEMS_EXCEPTION. This table will be
needed for the
EXCEPTIONS INTO option of the
CONSTRAINT clause in the
ALTER TABLE statement that we will
want to invoke (explained further down). The format that the exceptions table needs
to follow can be found in the SQL script
UTLEXCPT.SQL included with the
Oracle database product installation. Based on the information in this script, to
TABLE_ITEMS_EXCEPTION exceptions table for this example
execute the statement shown in Listing 5.
Listing 5. Statement to create the exceptions table
CREATE TABLE TABLE_ITEMS_EXCEPTIONS(ROW_ID ROWID, OWNER VARCHAR2(30), TABLE_NAME VARCHAR2(30), CONSTRAINT VARCHAR2(30));
Specify the PL/SQL anonymous block shown in Listing 6 in the After SQL statement property. The connector will submit it to the database once all the records have been loaded. See the comments embedded in the PL/SQL code for details about individual operations performed by this PL/SQL block.
Listing 6. Custom PL/SQL code for handling indexes and constraints
DECLARE -- Define the exception for handling constraint validation error ORA-02437. cannot_validate_constraint EXCEPTION; PRAGMA EXCEPTION_INIT(cannot_validate_constraint, -2437); BEGIN -- Truncate the tables TABLE_ITEMS_DUPLICATES and TABLE_ITEMS_EXCEPTIONS in case -- they contain any rows from the previous job runs. EXECUTE IMMEDIATE 'TRUNCATE TABLE TABLE_ITEMS_DUPLICATES'; EXECUTE IMMEDIATE 'TRUNCATE TABLE TABLE_ITEMS_EXCEPTIONS'; -- Try to enable the PK_TABLE_ITEMS_ID constraint and to build the underlying -- unique index. If any rows are in violation of the constraint then store -- their ROWIDs to the TABLE_ITEMS_EXCEPTIONS table. EXECUTE IMMEDIATE 'ALTER TABLE TABLE_ITEMS ENABLE CONSTRAINT PK_TABLE_ITEMS_ID EXCEPTIONS INTO TABLE_ITEMS_EXCEPTIONS'; EXCEPTION WHEN cannot_validate_constraint THEN -- The constraint could not be enabled. The constraint violations need to be -- handled before trying again to enable the constraint. BEGIN -- Set the default date format for the session to match the format of -- LOAD_DATE job parameter value. EXECUTE IMMEDIATE 'ALTER SESSION SET NLS_DATE_FORMAT=''YYYY-MM-DD HH24:MI:SS'''; -- Copy rows from TABLE_ITEMS table to TABLE_ITEMS_DUPLICATES table that -- violated the constraint and were loaded by this job. The rows that were -- loaded by this job will have the LOAD_DATE column value that matches the -- LOAD_DATE job parameter value. INSERT INTO TABLE_ITEMS_DUPLICATES SELECT ID, NAME, LOAD_DATE FROM TABLE_ITEMS WHERE ROWID IN (SELECT ROW_ID FROM TABLE_ITEMS_EXCEPTIONS) AND LOAD_DATE='#LOAD_DATE#'; -- Delete the rows from TABLE_ITEMS that were copied to TABLE_ITEMS_DUPLICATES. DELETE FROM TABLE_ITEMS WHERE ROWID IN (SELECT ROW_ID FROM TABLE_ITEMS_EXCEPTIONS) AND LOAD_DATE='#LOAD_DATE#'; -- Try to enable the constraint again. This time the operation should be -- successful. EXECUTE IMMEDIATE 'ALTER TABLE TABLE_ITEMS ENABLE CONSTRAINT PK_TABLE_ITEMS_ID'; -- Rebuild the non-unique index IDX_TBL_ITEMS_COL_NAME that was disabled -- during the load. EXECUTE IMMEDIATE 'ALTER INDEX IDX_TABLE_ITEMS_NAME REBUILD PARALLEL NOLOGGING'; END; END;
Utilizing Oracle date cache
You can configure the stage to take advantage of the Oracle date cache feature by
setting the Use Oracle date cache property to value Yes.
This feature is only available in bulk load mode and can significantly improve the
load performance when the input link has one or more
Timestamp columns, those columns are used to load values to
TIMESTAMP table columns and there are many repeating values for
The connector prepares values for
TIMESTAMP table columns in a format
that requires them to be converted prior to being stored in the table. When the date
cache is utilized the result of the conversion is cached so when the same value
appears again on the input the cached result is used instead of performing the
You can specify the number of entries in the date cache through the Cache
size property. To avoid cache misses when looking up the values in the
cache, set this property a value no smaller than the total number of distinct values
expected to be encountered on the input per each
Timestamp link column.
You can also choose to disable the date cache if it becomes full during load by setting Disable cache when full property to Yes. One scenario where this can be useful is when you run the job repeatedly to load records to the target table. Most of the times the total number of distinct input values is smaller than the selected cache size but occasionally most of the input values are unique. In that case, disabling the cache when it becomes full will avoid performing lookups on the full date cache, which will improve the performance because each of the those lookups would result in a cache miss.
Note that each player process for the stage will use its own date cache when the stage is configured to run in parallel execution mode.
In Information Server Version 9.1 when
input link columns are used to load values to
DATE table columns
TIMESTAMP table columns, the values provided by the
connector are in the format that does not require conversion prior to storing them
in the target table. Therefore, the use of date cache feature will not provide much
benefit in this case. In Information Server releases prior to Version 9.1, the
conversion will be needed in this case so the use of date cache will provide
benefits similar to when loading values to
TIMESTAMP target table
Disabling redo logging
Database redo log tracks changes made to the database and is utilized in the database recovery process. When the connector is loading data to the target table, the redo log is updated to reflect the rows loaded to the table. In case of the database storage media failure, after the database is restored from the most recent backup, the rows loaded by the connector since the backup was taken are restored from the redo log. The ability to restore loaded rows from the redo log leads to potential decrease in performance because the load operation takes longer to complete when the redo log needs to be maintained and capture the data loaded to the table.
To speed up the load operation performed by the connector you can disable the redo
logging by setting the
NOLOGGING attribute on the target table. You can
set this flag when you create the table, or if you wish to set it only during the
load you can do that by using the Before SQL statement and After SQL
statement connector properties. In the Before SQL statement
property specify the statement
ALTER TABLE table_name
NOLOGGING and in the After SQL statement> property specify
ALTER TABLE table_name LOGGING, where
table_name is the name of the table to which the stage
is loading data. Another option is to use the Disable logging connector
property – when this property is set to No then the logging flag
NOLOGGING) is used, and when it is set to
Yes, the redo logging is disabled.
Disabling redo logging for the bulk load can significantly improve the load performance. In case of the media failure the data that was loaded while redo logging was disabled can be restored from the database backup if the backup was taken after that load but it cannot be restored from the redo log. If restoring loaded data from the redo log in case of a media failure is not required, or if the data can be easily re-loaded in that case, then disabling redo logging should be considered for speeding up the load operation. Refer to the Resources section for the link to Oracle Database online documentation where you can find more information about redo log and considerations related to disabling the redo logging.
Loading records to a staging table
If you wish to use the connector in bulk load mode to quickly transfer the data to
the target database but at the same time you need the target table triggers, indexes
and constraints to be maintained as if a conventional insert statement was used, one
option for you to consider is to initially load the records to an intermediate
staging table that has the same definition as the target table, but does not contain
any triggers, indexes and constraints. After the load to the staging table
completes, issue a follow-up
that reads rows from the staging table and writes them to the target table. This
statement can be specified in the After SQL statement connector property so
that it is automatically invoked at the end of the job. The use of the staging table
results will cause data to be written twice (once when loaded by the connector to
the staging table and once when copied to the target table by the follow-up
statement) but will, at the same time, have the following beneficial
- The records will be processed only once by the job. This is important when the processing of records in the job is resource intensive.
- The records will be moved from the DataStage engine hosts to the database host only once.
INSERTstatement will be executing on the database side so during that time the DataStage engine hosts will be available to process other workloads.
- You can use database SQL functions in the
INSERTstatement to perform additional transformation on the values that were loaded to the staging table prior to storing them to the target table.
- If the
INSERTstatement that copies the values fails for some reason, the loaded records will remain available in the staging table and you can issue another
INSERTstatement after handling the rows in the staging table that have caused the error. You can utilize the
LOG ERRORS INTOclause in the
INSERTstatement where the rows that cause the
INSERTstatement to fail are directed to the error logging table along with the details indicating the reason for their error.
- The indexes, constraints and triggers on the target table will remain enabled, valid and enforced at all times.
You can configure Oracle Connector stage to run in manual load mode in which case the stage will write records to a file on the engine tier host that can then be passed to the Oracle SQL*Loader utility to load the records to the target table. To configure the stage in this mode set the Manual load property to Yes. Loading records in manual load mode is generally slower than loading them directly to the target table but in some cases can be very useful, as shown next.
One example would be if you want to run the job to extract and process the data at a time when the target database is not available for load. You can configure the stage to store the data to a file and then later, when the database is available, you can load the data to it using the SQL*Loader utility. The data will be already processed at that time and ready to be stored to the target table. In cases, when the processing phase of the job takes a long time to complete, such as for example when pulling data from an external source that is slow or when complex and CPU intensive transformations are involved in the job, the manual load approach can result in significant time savings.
Note that the connector actually generates two files in this mode – the data file with the records to be loaded and the control file used to pass options to the SQL*Loader and inform it about the location and format of the data file. You can also transfer the generated files to a host that doesn't have DataStage installed and load the records from there, presuming that you have SQL*Loader tool available on that host. If that host is the Oracle server host the load operation will be local to the database.
Restarting the load and backing up the data that needs to be loaded is also convenient with this approach because the generated data file is already in the format supported by the SQL*Loader tool.
Note that when the stage is configured to run in manual load mode it will try to connect to the database even if it will not be loading records to the database on that connection. If the target database is not available to accept connections at the time when you run the job, you can work around the problem by pointing the stage to another database that is available at that time, so that the stage can establish a connection and complete the manual load operation.
If it is critical to ensure the stage does not connect to any database in manual load mode, you can accomplish that by ensuring that the following conditions are met:
- The connector is not collecting operational metadata. To ensure this define the
DS_NO_PROCESS_METADATAenvironment variable for the job and set it to value
- The connector does not issue statements on the database to be interpreted as
events by the IBM InfoSphere Guardium Activity Monitor. To ensure this, ensure
that you don't have
CC_GUARDIUM_EVENTSenvironment variable set for the job.
- The Run before and after SQL statements property must be set to No. This ensures that the connector is not trying to perform any SQL statements on the database prior to and after loading the records to a file.
- The Table action property must be set to Append. This ensures the connector is not trying to create, drop or truncate a database table prior to performing the load operation.
- The Perform operations before bulk load and Perform operations after bulk load properties must not specify any index and constraint maintenance operations.
When all of these conditions are met then the connector will load records to the data file without connecting to the database. You will still need to provide Username and Password property values though in order to be able to compile the job. You can set them to some dummy values or you can set the Use external authentication property to value Yes which will disable the Username and Password properties so you will not need to provide values for them.
Note that even if you configure the stage not to connect to the database, you will still need to have Oracle Client product installed on the DataStage engine host on which the connector runs, otherwise the connector will fail the initialization when the job starts and the job will fail.
Array size and Buffer size
When the Oracle Connector stage is configured to run in bulk load mode, it reads records from the input link and stores them in the Oracle specific direct path column array. The connector does not create this array in memory like the array that it uses for fetch and DML operations. Instead, Oracle client creates this array on behalf of the connector. The size of this array in number of records is specified through the Array size property. The default value is 2000.
When the column array becomes full the connector requests from Oracle client to convert the array to the internal stream buffer format and to send this buffer to the Oracle server. The size of this stream buffer (in kilobytes) is specified through the Buffer size connector property. The default value for this property is 1024 (1 megabyte).
Depending on the length of the records, it is possible that the specified Buffer size value will not be sufficient to store the Array size number of records in stream format. In that case, the Array size will be automatically reduced to a smaller value and Oracle Connector will log a message to indicate the newly enforced number of records that it will load at a time. Note that regardless of the specified Buffer size value the maximum Array size value will be 4096.
When configuring the Array size and Buffer size values in order to achieve optimal performance, you can apply the following approach:
- Set the Array size to 4096 (maximum supported) and Buffer size to 1024.
- Run the job and check if the connector reported a message in the job log that it will load less than 4096 records at a time. If it did then go to step 3. If it did not then go to step 4.
- Continue trying with increasingly larger Buffer size values until the performance worsens or the connector stops reporting the message, whichever takes place first.
- Continue trying with decreasingly smaller Buffer size values until the performance worsens.
Loading records with the stage in parallel execution mode
The connector can be configured to run in parallel execution mode and load data from multiple player processes with each player process loading a subset (partition) of input link records to the target table. In this case, parallel load must be enabled for the target table segment. To do this set the Allow concurrent load sessions connector property to value Yes. If this property is set to No and you attempt to load records from multiple player processes, you will receive Oracle error "ORA-00604: error occurred at recursive SQL level 1" possibly accompanied by the Oracle error "ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired".
The exception to the above rule is when the stage is loading rows to a Range, List or Interval partitioned table (which can optionally be subpartitioned) and the Partition type setting for the link is set to Oracle Connector. This exception scenario is covered in the next section which discusses loading records to a partitioned table.
Setting Allow concurrent load sessions property to Yes has implications on how indexes are handled during load. For more information refer to the earlier section which discussed the handling of indexes.
Loading records to a partitioned table
When you wish to load records to a single partition or subpartition of the target table, set the Table scope connector property to Single partition or Single subpartition value, and specify the name of the partition or subpartition in the Partition name or Subpartition name property. Doing this will result in locking only the segment of the specified partition or subpartition and will also return error if any input record needs to be stored to another partition, that way providing verification mechanism that the records belong to the partition or subpartition specified in the stage.
When you wish to load records to multiple partitions of a table, set the Table scope connector property to value Entire table. In respect to triggers, indexes and constraints defined on the table, the concepts that apply to non-partitioned tables apply in this case as well. When rebuilding indexes on partitioned tables, the connector will automatically rebuild only those index partitions or subpartitions that have been marked unusable during the load process. This applies to both locally and globally partitioned indexes.
When you wish to load records in parallel and make sure that each player process is loading records to a dedicated table partition, you can do that for a table partitioned based on Range, List or Interval partitioning strategy (and is optionally subpartitioned) by specifying the Oracle connector value for the Partition type setting for the input link. In this case, each player process of the stage will be associated with a single dedicated table partition and will load data to that partition's segment (or its set of subpartition segments if the table is composite-partitioned) and because no two player processes will be accessing the same segment, the Allow concurrent load sessions connector property can be set to No. The exception is when the table has a globally partitioned index. In that case, the job will fail with the Oracle error "ORA-26017: global indexes not allowed for direct path load of table partition string".
When considering using Oracle connector value for Partition type when loading records to a partitioned table, consult the list provided in Part 1 article in the section that covered DML operations on a partitioned table.
Reject link considerations
Reject links are a convenient mechanism for handling input records for which the database reported an error when it tried to perform the operation specified in the stage. You can enable reject conditions on the reject link under which the records in error should be rejected, you can choose to include error code and error text information with the rejected records and you can point the reject link to another stage to store the rejected records to a location of your choice, such as a file or another database table.
However, reject links can have negative effect on the performance. When the stage has a reject link defined, it needs to inspect the error code for each record reported by the database to be in error, compare it to the conditions defined on the reject link and submit the record to the reject link, potentially accompanied by the error code and error text information. These operations consume system resources and affect the job performance as a whole.
You saw earlier that in order to use reject links to handle records with constraint
violations in bulk load write mode, you must run the stage in sequential execution
mode. Doing this will result in rejecting all the rows from the table that violated
the constraints, including those rows that were present in the table before the
load. Also, when you run
INSERT statements with the
APPEND_VALUES optimization hint you cannot rely on the reject link
To avoid these limitations associated with the use of reject links, avoid using the reject links altogether and try to eliminate the records in error BEFORE they reach the Oracle Connector stage or to eliminate them AFTER the stage has completed the operation for which it was configured.
As an example of handling records in error BEFORE they reach the Oracle Connector
stage, let us consider the
NOT NULL and
in a table created using the statement shown in Listing 7.
Listing 7. Statement to create test table
CREATE TABLE TABLE_TEST(C1 NUMBER(10) CONSTRAINT CONSTR_NOT_NULL_C1 NOT NULL, C2 VARCHAR2(20), C3 DATE, CONSTRAINT CONSTR_CHK_C1 CHECK (C1 > 20000 AND C1 < 70000));
The constraints on this table require that the values for field
NULL and that they are between 20000 and 70000 (exclusive). Instead
of configuring the Oracle Connector to use a reject link for handling these
constraints, the constraint checking can be done in the stages upstream of the
Oracle Connector stage. For example, you can use the Filter stage with one output
link leading to the Oracle Connector stage and another reject link leading to a
Sequential File stage (or any other stage to which you wish to direct the bad
records). In the Where Clause property of the Filter stage you would
specify a condition as the one shown in Listing 8.
Listing 8. Filter stage condition
C1 IS NOT NULL AND C1 > 20000 AND C1 < 70000
You would also set the Output Rejects property of the Filter stage to value True. The Filter stage would pass to its output link the records that meet this constraint, and would send the remaining records to the reject link.
This configuration is shown in Figure 2.
Figure 2. Handling constraint violations with a Filter stage
For more complex constraint expressions or to annotate rejected records with the
information describing the reason why they were rejected, you could use the
Transformer stage. For the previous example, the Transformer stage would have one
input link (for example, input_records) with columns
C3 and would have three output links. All three
output links would copy the columns from the input link. The first output link would
have the constraint
IsNull(input_records.C1) defined and an extra
column to contain the constant literal value
"C1 value is null.". The
second output link would have the constraint
input_records.C1 <= 20000 Or
input_records.C1 >= 70000 defined and an extra column to contain
"C1 value: " : input_records.C1 : " is not between 20000 and
70000.". The first and the second link would lead to the Funnel stage
that would combine the records on them and store them to a reject file using the
Sequential File stage. The third output link would lead to the Oracle Connector
stage that would write them to the table. This configuration is shown in Figure 3.
Figure 3. Handling constraint violations with a Transformer stage
The following are some examples of handling records with errors AFTER the stage has completed moving the records to the database:
- When bulk loading to the target table with constraints, load the records to a
staging table instead that doesn't have the constraints defined on it and then
issue a follow up
INSERTstatement to move the records to the target table. Specify
LOG ERRORS INTOclause in the insert statement so that the records in error are directed by to an alternate table. For more information, refer to the earlier section that covered loading of records to a staging table.
- When bulk loading to the target table with constraints, configure the stage to disable the constraints before the load and enable them after the load, specify the exceptions table but do not configure the stage to automatically process the exceptions. Instead, process them manually after the job completes. Alternatively, perform custom handling of constraints and indexes using a strategy such as the one suggested earlier in the article.
In this section, we will look at how selection of data types for the link columns can affect the connector performance. The default link column data types are provided by the connector when it imports the table definition. They are also provided by the connector at runtime when the stage has an output link without columns defined on it and the runtime column propagation feature is enabled for the link.
The default link column data types represent the closest match for the corresponding table column data types in terms of the ranges of values that they support. In some cases, you may have an additional insight regarding the format and range of the actual data values processed by the stage which may in turn allow you to select data types for the link columns that result in better performance in the connector. In this section, a number of such scenarios are examined.
You should always start with the link columns of default data types. To determine
the default data types, use the connector to import the table that you will be
accessing from the stage and look at the column definitions in the imported table
definition. If you configured the stage to issue a complex
statement that references multiple tables then you will not be able to import a
single table definition for that statement. What you can do in that case is create a
temporary view based on that statement and then import the view instead. When your
stage is configured to execute a PL/SQL anonymous block, you will not be able to
import a table or a view to determine the default link column data types. In this
case, you can analyze the PL/SQL block and for each bind parameter in the PL/SQL
block (specified in
format) determine the object in the database to which it corresponds. In some cases,
this object will be a column in a table or a view, in some cases it will be a
parameter of a stored procedure, but in all cases that object will have an
associated Oracle data type. Create a temporary table with the columns of those same
data types and import that table.
Refer to the Resources section for the link to the Information Server Information Center where you can find the table that shows how Oracle Connector maps Oracle data types map to DataStage data types by default.
Once you have placed columns of default data types on the link, run the job and measure the performance. Then apply the information from the remainder of this section to further optimize the job by changing link column data types in such a way so that the performance improves while no loss of data occurs.
Representing CHAR, VARCHAR2, NCHAR and NVARCHAR2 table column data types
In this section, we will look at the
NVARCHAR2 Oracle data types.
When you use Oracle Connector to import tables with columns of these four data
types, the connector will model
columns as DataStage
VarChar link columns, and it
NVARCHAR2 table columns as DataStage
NVarChar link columns. The exception to this
rule is the case when
VARCHAR2 table column
definitions are based on the character length semantics and the database
character set is a multi-byte character set such as
AL32UTF8. In that
VARCHAR2 columns are modeled as
NVarChar link columns.
If your database character set is a multi-byte character set such as
AL32UTF8 and your table columns use character length semantics,
such as a column defined to be of
VARCHAR2(10 CHAR) type, and all the
actual data values use only single-byte characters, then use the
VarChar columns on the link (with Extended attribute left blank)
instead of the
Minimize the character set conversions in the stage by ensuring that the effective
NLS_LANG environment variable value for the job is compatible with
the NLS map name specified for the job. This is important not just for the
performance reasons, but for the correct functioning of the stage in regard to
interpretation of character set encoding of the values. To determine the effective
NLS_LANG value for the job, run the job with the
CC_MSG_LEVEL environment variable set to
2 and inspect
the job log. The connector will log the effective
variable value. If you wish to use a different value, you will need to define
NLS_LANG environment variable for the job and set it to the desired
value. Oracle Connector stage editor does not have the NLS tab so the NLS
map name selected for the job is used for the stage. You can specify this value for
the job by opening the job in the DataStage Designer, selecting the menu option
Edit and then clicking Job properties and the
NLS page in the dialog and providing the value in
Default map for stages setting. Table 1
shows examples of compatible NLS map name and
NLS_LANG values. Note
that you should set the language and category portions of the
value to match the locale of your DataStage engine host.
Table 1. Compatible NLS map name and NLS_LANG values
|NLS map name||Compatible NLS_LANG|
If the values processed by the stage are based on only ASCII characters, set the NLS
map name to
ASCL_ASCII and set
AMERICAN_AMERICA.US7ASCII. Otherwise if possible set the
NLS_LANG value to match the database character set. For example, if
the stage is writing records to a database with the
character set, use the
NLS_LANG value of
AMERICAN_AMERICA.AL32UTF8 and make sure the NLS map name for the
job is set to
UTF-8. To determine the database character set you can
connect to the database from SQL*Plus and issue the statement shown in Listing 9.
Listing 9. Determining the database character set
SELECT VALUE FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER='NLS_CHARACTERSET';
Always make sure that the values in the job represented by the
VarChar link columns are actually encoded as indicated in the NLS
map name and
NLS_LANG settings and that the values for
NVarChar link columns are UTF-16 encoded.
Note that the connector handles
columns which have the Extended attribute set to Unicode the same way it handles
NVarChar link columns.
Make sure that all
NVarChar link columns have Length attribute set to a concrete
value. Leaving the Length attribute empty can have severe negative impact on the
performance of the stage. When the Length is not specified, the connector makes
assumption that the length is the maximum one allowed for the respective bind
parameters, which is 4000 bytes. The connector will therefore allocate the memory
with the assumed column lengths, which may result in large memory consumption.
Suppose that the connector is configured to insert rows to a database table with the
default Array size of 5000, and the link contains 50
columns that do not have the Length attribute set. The connector will allocate (in
each player process) approximately the total of 50 (columns) x 5000 (array size) x
4000 (maximum value length) bytes which is close to 1 gigabyte. If the actual table
columns in the database are defined as
VARCHAR2(50 BYTE) then by
setting the Length attribute of the link columns to 50, the Array size
could be left set to 5000 and the size of the allocated memory per player process
would drop to 50 (columns) x 5000 (array size) x 50 (maximum value length) bytes
which is about 12 megabytes.
Do not use
LongNVarChar link columns to represent
columns. These link columns are appropriate for handling table columns of
XMLType columns as discussed later in the text. Even in those cases, if the actual
values for those columns are less than 4000 bytes in size, the
NVarChar link columns should be used instead
LongNVarChar link columns. When
LongNVarChar link columns are used,
the connector disables array processing and enforces the Array size value
of 1, even if you specify small Length attribute values for those link columns.
Enforcing the Array size of 1 can have a major negative impact on the
Ensure that the connector is using optimal character set form for the
NVarChar link columns. Character set form is an Oracle database
concept and can be one of two values:
The optimal character set form for the connector to use for
VARCHAR2 table columns is
IMPLICIT and for
NVARCHAR2 table columns it is
The connector will automatically use optimal character set form if you use
VarChar link columns for
VARCHAR2 table columns and
NVarChar link columns for
NVARCHAR2 table columns. As stated earlier, note that
VarChar link columns with Extended attribute
set to Unicode are treated as
In some cases, you may end up using link columns that result in sub-optimal
character set form selection in the connector. For example, when you use the
connector to import a table with
VARCHAR2 columns that are based on
character length semantics and the database is based on
(multi-byte) character set, the connector will import that column as
NVarChar link column. By importing it this way the connector is
able to set the Length attribute of the link column to match the length specified in
the table column definition while ensuring that data truncation does not happen at
runtime. At runtime the connector will choose the character set form
NCHAR for this link column which will not be the optimal choice for
VARCHAR2 target table column. If this link column happens to be
referenced in a
WHERE clause in the statement specified for the stage
then any index defined on this table column will not be utilized. This applies to
SELECT statements used by the stage configured for sparse lookup
mode, as well as
DELETE statements when the
connector is configured to write records to the database in Insert,
Update, Delete, Insert then update, Update then
insert or Delete then insert write modes. The skipping of the
index in this case can have significant performance implications. To force the
connector to use the optimal character set form in this case, you could apply one of
the following strategies:
- Change the link column data type from
VarChar. In this case, you will also need to update the Length attribute for this link column by multiplying it by 4 because UTF-8 characters can consume up to 4 bytes. As we mentioned before, if you know that all the actual data values processed by the stage use single-byte characters, then you can leave the Length attribute unchanged.
- Change the data type of the table column to
NVARCHAR2. In this case, the
NCHARcharacter set form selected by the connector will be the optimal choice. In many practical cases though, making changes to the existing column definitions in the database tables is not an option.
- Utilize the
CC_ORA_BIND_FOR_NCHARSenvironment variable. Set the value of this environment variable for the job to contain comma-separated list of link column names for which you wish the connector to use
NCHARcharacter set form. If you specify the special value
(none)for this environment variable, then the connector will use
IMPLICITcharacter set form for all columns on the link. Another special value is
(all)(the parenthesis are part of the value) which will result in the connector using
NCHARcharacter set form for all the columns on the link. If none of the tables in your database have
NCLOB(discussed later) columns then you can define
CC_ORA_BIND_FOR_NCHARSenvironment variable at the DataStage project level and specify
(none)as its default value (again the parenthesis are the part of the value). That way the connector will always use
IMPLICITform which will always be the optimal character set form.
Representing NUMBER, FLOAT, BINARY_DOUBLE and BINARY_FLOAT table column data types
In this section we will take a look at the NUMBER, FLOAT, BINARY_DOUBLE and BINARY_FLOAT Oracle data types.
NUMBER(p, s) table column where precision
p and scale
s are explicitly
specified, the default and at the same time the optimal link column data type is
Decimal with Length attribute set to
Scale attribute set to
s. The exception to this rule are the
s < 0 and where
p. In those cases, the values need to be provided to the
stage and retrieved from the stage in text format so the link column should be of
VarChar type with the Length attribute value sufficiently large
value to accommodate all possible values for the respective column in the table.
NUMBER(p) table column where precision
p is explicitly specified but scale
s is omitted, the default link column data type is
Decimal with the Length attribute set to
Scale attribute left empty. If the remaining stages in the
job are operating on integer (
BigInt) values for these columns then you
may be able to achieve better performance if you preserve those integer data types
for the columns on the link of the Oracle Connector stage than if you change them to
Decimal data type. This is especially the case when the connector
stage is writing to a partitioned table and the corresponding table column is used
as part of the partition key of the table and you specified Oracle
connector value for the Partition type setting of the link. If you
decide to replace
Decimal(p) link columns with integer link
columns keep in mind the differences between these link data types. Although they
all represent integer values, the ranges of values that they cover differ.
For example, if a table column is defined as
NUMBER(5) and you choose
to replace the default
Decimal(5) link column with a
SmallInt link column, the values supported by the
NUMBER(5) table column will be in the range [-99999, 99999] and the
values supported by the
SmallInt link column will be in the range
[-32768, 32767]. If you set the Extended attribute of the link column to Unsigned,
the values that are supported by the link column will be in the range [0, 65535]. In
both cases, the range of the link column will be a sub-range of the respective table
column. So you should only use this approach if you are sure that all the values
that will be processed by the stage belong to both ranges. In the previous example,
you may choose to use
SmallInt link column (signed or unsigned) when
writing values to the database, but to use
Integer column (signed) when
reading values from the database because the range of the supported values for the
Integer (signed) link column is [-2147483648, 2147483647] and is
therefore sufficient to support any value supported by the
table column. Note that if you choose to use
BigInt column on the link,
the range of supported values for the link column will be [-9223372036854775808,
9223372036854775807]. And if the Signed column attribute is set to value Unsigned
the range will be [0, 18446744073709551615]. However, the choice of
BigInt link column will likely result in smaller performance gains
column is used, and may even produce worse performance than the
link column. This is because the connector will exchange
columns values with the database as text values which will involve additional data
NUMBER table column where both precision
p and scale
s are omitted, the
default link column data type is
Double. But if you are certain that
the actual values in the table are all integer values, you may choose to use link
column of one of the integer data types (
BigInt) or you may choose
to represent the values as decimal values with a specific precision and scale by
using a link column of
Decimal data type. Keep in mind that the
supported ranges of these link column data types are not fully compatible so only
use these link column data types in cases when they will not result in the loss of
data precision due to the rounding or in cases when such a rounding is acceptable.
table columns, the default and optimal column types to use for the corresponding
link columns are
Representing DATE, TIMESTAMP and INTERVAL table column data types
In this section we will take a look at the DATE, TIMESTAMP and INTERVAL Oracle data types.
TIMESTAMP(0) WITH TIME
TIMESTAMP(0) WITH LOCAL TIME ZONE table columns the
default link column data type is
Timestamp with Extended attribute left
empty. This is the optimal link column type to use in case when hour, minute and
second portions of the values need to be preserved. Otherwise, the
link column will likely result in better performance because the stage does not need
to handle hours, minutes and seconds in the values. Note that if you need to
preserve time zone information in the values you need to use character based link
columns such as
VarChar. This is covered in more detail later in this
TIMESTAMP(p) WITH TIME
TIMESTAMP(p) WITH LOCAL TIME ZONE table
columns where precision
p > 0, the default link column data
Timestamp with Extended attribute set to Microseconds. The
Timestamp link columns with Extended attribute set to
Microseconds takes considerably more time in the Oracle Connector than the
Timestamp link columns which have the Extended attribute
left empty. This is especially the case for
SELECT and DML
DELETE) statements and
to a lesser extent bulk load operation. Avoid the use of
columns with Extended attribute set to Microseconds unless you absolutely need to
support fractional seconds in the timestamp values.
The connector models
WITH TIME ZONE and
TIMESTAMP(p) WITH LOCAL TIME
ZONE table columns with
p > 6 as
Timestamp link columns with Extended attribute set to Microseconds.
The connector will in this case perform truncation of fractional seconds to the
microsecond precision. To handle timestamp values with fractional second precision
greater than 6 up to and including 9, use the
VarChar link column. In
that case, the connector will exchange timestamp values with Oracle client as text
values. Ensure in this case that the
parameter is set to match the format of the actual values provided to the stage on
its input link. The same format will be used by the stage for the values that it
provides on its output link. You can set this session parameter using the Before
SQL (node) statement property, by issuing an
statement. For example, to handle timestamp values as text values with 9 fractional
digits, a statement like the one shown in Listing 10 could be
specified in the Before SQL (node) statement property.
Listing 10. Setting the timestamp format with 9 fractional seconds for the session
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF9'
You can apply this same approach for
TIMESTAMP(p) WITH TIME
INTERVAL YEAR (yp) TO MONTH and
INTERVAL DAY(dp) TO SECOND(sp) table columns. For
TIMESTAMP(p) WITH TIME ZONE table columns set the
NLS_TIMESTAMP_TZ_FORMAT session parameter to the appropriate
format, which includes the time zone format if the time zone needs to be preserved
in the values. The
INTERVAL(yp) TO MONTH and
DAY (dp) to SECOND(sp) values should be specified as
interval literals in the format defined in the Oracle database. Refer to the Resources section for the link to Oracle Database online
documentation where you can find more details and examples for
Oracle data types and interval literals.
If you are using
Timestamp (with Microseconds) link columns because you
need to preserve microseconds in timestamp values, you can again apply this same
approach as it can potentially lead to improved performance. For example if you are
fetching values from a
TIMESTAMP table column and the downstream stage
is capable of handling timestamp values in text format, then replace the
Timestamp (with Microseconds) column on the output link of Oracle
Connector stage with a
VarChar column so that the connector provides
timestamp values in text format. Likewise, if you are inserting values to a
TIMESTAMP table column and the upstream stage can provide the
timestamp values in text format, replace the
Microseconds) column on the input link of Oracle Connector stage
VarChar column so that the connector accepts timestamp values in
text format. The Length attribute of the
VarChar column and the
NLS_TIMESTAMP_FORMAT session variable must be appropriate for the
format of the timestamp text values. For example, to use the default
YYYY-MM-DD hh:mm:ss.ffffff format for the timestamp values, set the
Length attribute of the
VarChar column to 26 and specify the statement
shown in Listing 11 in the Before SQL (node)
statement connector property.
Listing 11. Setting the timestamp format with 6 fractional seconds for the session
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF6'
Even in the cases when the remaining stages of the job are not capable of handling
timestamp values with microseconds in text format and require that the
Microseconds) link column is used, you
can choose to map this column to a
VarChar column explicitly for the
Oracle Connector stage by using a Transformer stage as shown below:
- If the link column is located on the output link of the Oracle Connector stage
and the link name is out_link, insert a Transformer stage between the Oracle
Connector stage and the next stage in the job flow and map the
VarChar(26)column on the input link of the Transformer to the
Timestamp(with Microseconds) column on the output link of the Transformer stage using the derivation shown in Listing 12.
Listing 12. Converting string to timestamp
- If the column is located on the input link of the Oracle Connector stage and
the link name is in_link, insert a Transformer stage between the previous stage
in the job flow and the Oracle Connector stage and map the
Timestamp(with Microseconds) column on the input link of the Transformer stage to the
VarChar(26)column on the output link of the transformer stage using the derivation shown in Listing 13.
Listing 13. Converting timestamp to string
Representing RAW table data type
In this section we will take a look at the
RAW Oracle data type.
VarBinary link columns to read and write
column values in their native raw binary form. For the Length attribute of the link
columns specify the sizes of the respective
RAW table columns.
VarChar link columns if you need to read or write binary values
as sequences of hexadecimal digit pairs. In this case, make sure to set the Length
attribute of the links columns to a value twice the size of the respective
RAW table columns because each byte in a
column value will be represented by a pair of hexadecimal digit (0-9 and A-F)
characters. The use of
VarChar link columns will allow you to read and
RAW table column values in hexadecimal digit pair's form without
a need to handle the conversions in another stage in the job.
Representing LONG, LOB and XMLType table column data types
In this section we will take a look at the
XMLType Oracle data types.
The preferred link column data types for these table column data types are shown in Table 2.
Table 2. Preferred link column data types for LONG, LOB and XMLType table column data types
|Table column data type||Preferred link column data type|
|BLOB, LONG RAW||LongVarBinary|
|CLOB, LONG, XMLType||LongVarChar|
You can use other combinations of link and table column data types as well but doing that may negatively affect the performance and will likely require you to use additional data type conversion functions in your SQL statements in order to avoid data type mismatch errors.
Every time you have
LongNVarChar columns defined on the link the connector enforces
Array size value of 1 for that link which has negative effect
on performance. This is the main reason why
LongNVarChar link columns should only
be used when handling the table column data types covered in this section.
When the values represented by the
LongNVarChar link columns are 4000
bytes long or less, you should use
NVarChar link columns instead, respectively, even when the
corresponding table columns are of the data types discussed in this section. When
you do that the connector will not enforce the Array size value of 1 which
will typically result in significant performance improvements. Also the guidelines
presented earlier for character and binary table column data types will apply in that case instead of the
guidelines presented in the remainder of this section.
When you configure the connector stage to fetch values from a table column
represented by a
LongNVarChar link column, you have two choices for how the
connector should pass the fetched values for that column to the output link. You can
configure the stage to pass the values inline which means that each value is passed
in its original form to the output link. Alternatively, you can configure the stage
to pass the values by reference which means that a locator string representing the
location of each value is passed instead of the actual value. The actual value
represented by the locator string is then fetched by the last downstream stage in
the job that stores this value to its intended destination. The locator string
contains information that specifies how to retrieve the actual value from its
By default, the Oracle Connector stage passes values inline. To pass the values by
reference you need to set the Enable LOB references connector property to
Yes, and then in the Columns for LOB references
connector property specify the names of the input link columns for which you wish
the connector to pass the values by reference. You can use the following guidelines
when deciding to pass values by reference or inline for the
LongNVarChar link columns:
- Do the values only need to be passed through the job or do they also need to be interpreted (parsed or modified) by certain stages in the job? In the latter case, do not pass the values by reference because it is not the values that are passed in that case but the locator strings that represent them.
- Which stage in the job is the consumer of the values? Only stages of
connector-based stage types such as WebSphere MQ Connector, DB2 Connector, ODBC
Connector, Teradata Connector and Oracle Connector are capable of recognizing
locator strings and resolving them to actual values. Stages of other stage types
will treat locator strings as the actual field values. For example, you can use
Oracle Connector stage to pass
CLOBvalue by reference to a DB2 Connector stage to store the value to a
CLOBcolumn in the DB2 database. But if you pass that value by reference to a Sequential File stage, the locator string will be stored in the target file and not the actual
CLOBvalue. You can use the latter scenario to check the size of the locator strings so that you can compare them to the average size of the actual values and estimate the savings in terms of the amount of data transferred through the job when the locator strings are passed instead of the actual values. Typically the locator strings will be less than 2000 characters in length.
- How large are the actual values? The larger the values are the more likely that passing them by reference will result in performance gains. As a rule of thumb values that measure about 100 kilobytes should be passed inline otherwise they should be passed by reference. Understanding the specifics of your environment in combination with the actual testing that you perform should help you with this decision. For example, if the Oracle Connector stage that reads the value from the database runs on a DataStage engine host which is different from the host in which the downstream stage consumes those values, then passing values by reference may provide performance benefits even for smaller actual values (as long as they are still larger than the locator strings) because they will be need to transferred over the network as they get exchanged between the player processes of the respective stages.
Note that DataStage imposes a limit on the size of records transferred between the
player processes in the job. The
built-in DataStage environment variable can be used to control this limit and by
default it is set to
131072 bytes (128 kilobytes). For example, if you
pass link column values inline and the values consume hundreds of kilobytes then you
will need to override this environment variable and set it to a sufficiently large
value, otherwise you will receive error message at runtime indicating that the
record is too big to fit in a block. Set the value for this environment variable to
accommodate the largest record that will be passed through the job. Do not set it to
a value larger than necessary because the value will apply to all the player
connections on all the links in the job so the increase of this value may lead to a
significant increase of the memory usage in the job.
Oracle Connector stage handles values represented by
LongNVarChar link columns as follows:
- For the read mode (fetching) it utilizes one of the following two Oracle mechanisms - piecewise fetch or LOB locators. The connector inspects the source table column data type and automatically selects the mechanism that is more suitable for that data type.
- For the bulk load write mode it utilizes the Oracle direct path interface when loading large values. Again the connector does this automatically and there are no additional options to consider.
- For the write modes other than bulk load it utilizes one of the following two
Oracle mechanisms - piecewise inserts/updates or OCI LOB locators. By default it
uses piecewise inserts/updates. To force the connector to use OCI LOB locators
for some of the link columns, specify those link column names in a
comma-separated list and set that list as the value of the
CC_ORA_LOB_LOCATOR_COLUMNSenvironment variable for the job. To use OCI locator string mechanism for all
LongNVarCharlink columns, specify the special value
(all)for this environment variable (the parenthesis are included with the value). Note that for the link columns that represent
LONG RAWtarget table columns the connector must be configured to use piecewise insert/update mechanism. Also note that for the link columns that represent
XMLTypetarget table columns the connector must be configured to use OCI locator string mechanism if the values are larger than 4000 bytes. Do not confuse LOB locators with the locator strings used when passing values by reference. The former is an Oracle mechanism for accessing large values in the database, the latter is a DataStage mechanism for propagating large values through the job.
The character set form and character set conversion considerations that affect the
performance and that were presented earlier in the
section which covered
NVARCHAR2 table column data types apply here as well. Simply assume
the use of
LongNVarChar link columns
instead of the
NVarChar link columns as you
take those considerations into account.
As we saw in this section, to populate target table with records that contain
LongNVarChar link columns you have three different options: bulk
load mode, insert mode with piecewise insert mechanism and insert mode with OCI LOB
locator mechanism. The three modes are available when the values arrive to the stage
as inline values as well as when they arrive as locator strings (values passed by
reference). Typically, the bulk load mode will have the best performance, followed
by the insert mode with piecewise insert mechanism and then the insert mode with OCI
LOB locator mechanism. Depending on the specifics of your environment this may not
always be the case and one way to determine which option provides the best
performance is to run a series of tests and compare the results. The exception is
the case when the target table has columns of
XMLType data type in
which case bulk load mode should not be considered if
BINARY XML or
OBJECT RELATIONAL storage type (as opposed to
storage type) is used for those
XMLType columns or if the table is
defined as a
XMLType object table.
When fetching, inserting or updating
XMLType values, utilize the Oracle
XMLType package functions
CREATEXML. Refer to the Resources section for the link to the Information Server
Information Center where you can find example statements for accessing XMLType
columns with the connector.
In this article we covered various aspects of performance tuning of the Oracle Connector stage in DataStage jobs. We discussed running Oracle Connector in bulk load mode taking into account data volume considerations, handling table triggers, indexes, and constraints. We also looked at reject link considerations for handling input records that have errors. Finally, we went over how the data types for link columns can affect the connector performance.
In combination with the information listed in the Resources section and the information you collected through your own research, this article should help you achieve optimal performance of Oracle Connector stages in your DataStage jobs.
Many thanks to the following contributors for their valuable input and review of this article:
- Paul Stanley, InfoSphere Senior Architect
- Tony Curcio, InfoSphere Product Manager
- Fayaz Adam, InfoSphere Staff Software Engineer
- Use an RSS feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Explore the IBM InfoSphere Information Server Version 9.1 Information Center where you will find technical documentation for Information Server tools and components such as DataStage and Oracle Connector.
- Visit IBM Information Management Redbooks page and download redbook titles such as InfoSphere DataStage Parallel Framework Standard Practices and IBM InfoSphere DataStage Data Flow and Job Design.
- Find details about Oracle Database concepts mentioned in this article in the Oracle Database Documentation Library 11g Release 2 (11.2).
- Access a variety of technical resources for Information Management products in developerWorks Information Management zone.
- Stay up to date with IBM products and IT industry by attending developerWorks Events.
- Follow developerWorks on Twitter.
Get products and technologies
- Evaluate IBM software products in a way most convenient for you: by downloading trial product versions, by trying products online for few hours in a SOA sandbox environment or by accessing products in a cloud computing environment.
- Connect with other developerWorks users in developerWorks Community as you explore developer oriented blogs, forums, groups, wikis and more.