In the Part 1 article in this series we provided a brief overview of the IBM InfoSphere DataStage product in the IBM Information Server suite of products and explained the role of the Oracle Connector stage in DataStage jobs. We explained the difference between conductor and player processes for the stage. We introduced the concepts of parallelism and partitioning and explained how their meaning differs between the DataStage environment and the Oracle database environment. We covered in detail the performance tuning and troubleshooting of the Oracle Connector stage configured to perform SQL statements on the database, including query statements used to fetch and lookup rows in the database as well as DML and PL/SQL statements used to insert, update and delete rows in the database.
In the Part 2 article of the series we will continue our coverage of the Oracle Connector performance tuning. First we will focus on performance aspects of the bulk load mode in the connector. In this mode the connector is used to load data to the database by utilizing the Oracle direct path interface. We will cover the operation of loading the records to the database as well as the operations that often need to be performed on the table indexes and constraints before and after loading the records. We will then present a few guidelines regarding the use of reject links with the connector. We will show how reject links can affect the connector performance in some cases and present some alternative approaches for your consideration. We will end the article by presenting you with a number of performance tuning strategies oriented towards the handling of different Oracle data types in the connector.
As was the case with the Part 1 article, the Part 2 article assumes the use of DataStage parallel jobs and Information Server Version 9.1, although some of the presented concepts would apply also to DataStage server jobs as well as the earlier Information Server versions.
When the Oracle Connector stage is configured to run in bulk load mode it utilizes the Oracle direct path interface to write data to the target database table. It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks. To configure the stage to run in bulk load mode the Write mode property of the stage needs to be set to value Bulk load. In this section, we will look at the connector properties and environment variables that play important role for tuning bulk load operation performance in the stage, as shown in Figure 1.
Figure 1. Connector properties that affect bulk load performance
Running the stage in Bulk load write mode typically results in better performance than when running it in Insert write mode. However, Bulk load write mode imposes several restrictions that are not present when the Insert write mode is used, especially when running the stage in parallel execution mode with Allow concurrent load sessions connector property set to value Yes. These restrictions are primarily related to the handling of triggers, indexes and constraints defined on the target table.
The information in the remainder of this section is intended to help you answer the following questions:
- Should I configure the stage to run in Bulk load or Insert write mode?
- If I choose to run the stage in Bulk load write mode, how should I configure the remaining settings in the stage in order to achieve optimal performance?
Data volume considerations
When deciding to configure the stage in Bulk load or Insert write mode, the critical aspect to consider is the data volume, in other words how many records on average is the stage writing to the target table in a single job invocation.
If the data volume is relatively small, the preferred option would be to use Insert write mode due to more flexibility that this write mode provides in respect to maintaining indexes, constraints and triggers of the target table. With the Insert write mode all the table indexes, constraints and triggers are enforced during the job execution and remain valid after the job completes. For example, if the stage that is running in Insert write mode tries to insert a record to the target table that violates the primary key constraint in the table, the stage can be configured to handle this scenario and to continue processing the remaining records in a number of ways that are not readily available when the stage is running in Bulk load write mode.
- It can have a reject link defined and configured to process records that have violated constraints. Each input record that violates table constraints will be routed to the reject link, with optionally included error code and message text explaining why the record was rejected. The existing row in the table for which the primary key was violated will remain in the table.
- It can be configured to run in Insert then update write mode in which case the existing row in the target table will be updated with the values from the record that violated the constraint.
- It can be configured to run in Insert new rows only write mode, in which case the record that violated the constraint will be skipped (ignored).
- It can be configured with the
INSERTstatement that contains
LOG ERRORS INTOclause which will result in Oracle redirecting records that violated the constraint along with the error information to the specified error logging table.
If the data volume is relatively large and the primary goal is to append the data to the target table as quickly as possible then the Bulk load mode should be considered.
The question is what should be considered a small data volume as opposed to large data volume. This decision will depend upon other things like how much performance you are willing to sacrifice when you choose the Insert write mode, which in turn depends on the size of the batch window that is available to push the data to the database and the available system resources. As a rule of thumb, thousands of records would likely be considered a small data volume, while millions of records would likely be considered a large data volume.
If the table does not have any indexes, constraints and triggers defined then Bulk load write mode would make a good candidate irrespective of the data volume involved because many of the restrictions imposed by this write mode would not apply any more. This includes the scenario where the table does in fact have indexes, constraints and triggers defined but you have procedures in place to disable or drop them prior to loading the data and then enable or rebuild them after the load is complete.
A simple test like the following can help you decide which write mode to choose:
- Design a test job with Oracle Connector stage that has an input link. In the test job provide records similar in volume and structure to the records that will be processed by the actual job in production.
- Ensure that the target database table used in the test job has the same definition as the table that will be used in production. If the production job performs incremental data loads to an existing table in the database, ensure that the table used for the test job is initially populated with a similar number of rows as the table that will be used in production.
- Run the job a few times in insert mode and capture the average time it took the job to complete.
- Manually disable any indexes, constraints and triggers on the target table and repeat the job a few times with the connector stage configured in bulk load write mode and capture the average time it took the job to complete.
- Compare the results from the previous two sets of job runs. If the numbers are similar that would be an indication that the Insert write mode is sufficient. If bulk load completes much faster, determine which additional operations need to be performed on the table to restore the constraints and indexes. Issue these operations from a tool like SQL*Plus to see how much time they take to complete. If that time combined with the time to bulk load the data is still considerably shorter than the time that it took to insert the data, then the bulk load mode is likely to be a better choice.
If the target table has triggers that need to be fired for every row written to the table then the bulk load mode should not be used because database triggers are not supported in this mode. To bulk load data to a table that has triggers, the triggers need to be disabled prior to the load. The connector provides an option to disable the triggers automatically before to load and enable them after the load. Once the triggers are enabled they will fire only for the rows inserted to the table after that point. They will not fire for the rows that were loaded while the triggers were disabled.
If the table has indexes, you can configure the stage to disable all the
indexes when performing the load. To do this set the Index
maintenance connector property to value Skip
all. This is typically the best option to choose as it allows you
to run the stage without restrictions in parallel execution mode and with
any Partition type value selected on the Partitioning
page for the input link. However, all indexes are marked
UNUSABLE after the load completes and should be
rebuilt. This is where the data volume and the number of indexes play a
critical role. If you are incrementally loading a relatively small number
of records compared to the number of records already present in the table,
it may be significantly faster to run the stage in Insert write
mode and have the database maintain indexes throughout the insert
operation instead of having to rebuild all the indexes after bulk load. On
the other hand, if the target table is empty and you need to load many
records to it then you will likely be benefited in using the Bulk
load write mode because the indexes will need to be built on all
the rows, and in that case, it may be faster to first load all the records
without maintaining the indexes and then build the indexes from scratch.
Instead of rebuilding the indexes in an external tool after the load, you can configure the stage to request from Oracle database to maintain the indexes during the load by setting the Index maintenance option property to Do not skip unusable or Skip unusable. The use of these two options is highly restrictive though and in most cases will not be the most optimal choice. Here are the things that you need to keep in mind if you consider using one of these two options:
- You will need to set Allow concurrent load sessions connector property to No, otherwise you will receive Oracle error "ORA-26002: Table string has index defined upon it" before any data is loaded. This requirement further implies that you will not be able to run the stage in parallel execution mode, except for one special case when loading to a partitioned table which is covered later in the article in the section that talks about loading records to a partitioned table.
- If the table has a
UNIQUEindex, the index will remain
VALIDafter the job completes but only if none of the loaded rows have violated the uniqueness imposed by the index. Otherwise the index will automatically be marked
UNUSABLEeven though the job completes successfully. Subsequent attempts to run the job will result in failure and the reported Oracle error depend on the Index maintenance option property value. For Do not skip unusable value the error will be "ORA-26028: index string.string initially in unusable state" and for Skip unusable value the error will be "ORA-26026: unique index string.string initially in unusable state".
- If the table has a
NON-UNIQUEindex, the index will remain
VALIDafter the load. If the index was marked
UNUSABLEbefore the load, the behavior will depend on the Index maintenance option property value. For Do not skip unusable value it will result in Oracle error "ORA-26002: Table string has index defined upon it" and for Skip unusable value the job will complete and the index will remain in
You can configure the stage to automatically rebuild indexes after the
load, and you can control the
PARALLEL clauses of the index rebuild
statements to optimize the index rebuild process. You can configure the
stage to fail the job if any index rebuild statement fails or to only
issue warning messages. To improve the performance the connector stage
will rebuild the indexes only when necessary. It will inspect the state of
each index before rebuilding it. If the index is marked
VALID the connector will skip rebuilding it.
This applies also to locally and globally partitioned indexes. In that
case, the connector will inspect the index partitions and subpartitions
(if the index is locally partitioned and the table is
composite-partitioned) and rebuild only those index partitions and
subpartitions that are marked
The presence of integrity and referential constraints on the table has implications on the bulk load operation similar to those imposed by the indexes.
NOT NULL and
constraints do not need to be disabled prior to the load. If any input
records violate the constraint the job will fail and no records will be
loaded to the table. The exception to this is the case when the stage is
configured to run in parallel execution mode and some player processes
finish loading their record partitions before the player process that
encountered the record in error. In that case, the records loaded by the
player processes that have already finished will be committed to the
table. From the performance point of view, a good approach for dealing
with these two types of constraints is to leave them enabled on the table
during the load but ensure that no records violate them by performing the
necessary checking and error handling upstream of the connector stage. We
provide some concrete suggestions later in the section that addresses reject links.
PRIMARY KEY constraints also do not need to
be disabled prior to the load but are effectively ignored by the load
operation. Any records that violate these constraints will still be loaded
and the constraints will remain
ENABLED. To ensure that these constraints are
valid for the rows loaded by the connector stage you will need to disable
them prior to the load and enable them after the load. Keep in mind that
these types of constraints are usually accompanied by indexes. The
through unique indexes and
often have user-defined indexes created for them. In the earlier section
about handling indexes we explained the effects
that indexes have on the load operation. The same effects apply in this
case and need to be taken into consideration.
The connector provides an option to disable constraints before the load. You can specify this by setting the Disable constraints property to Yes. When this is done, the load operation typically completes faster than the insert operation. However, you have to enable and validate the constraints following the load and deal with the loaded records that have violated the constraints. That part can have a major impact on the overall performance. If you have a mechanism in place outside of the DataStage job to enable and validate the constraints after the load, then having the connector disable them before the load will provide for the best performance of the job itself since there will be no restrictions during the load due to constraints and no time will be spent at the end of the job to enable the constraints.
You can configure the stage to enable and validate the constraints after
the load by setting Enable constraints property to
Yes. But, if any constraint fails the validation at
that time, the job will fail and the constraint will remain in
NOT VALIDATED state. You can specify the
exceptions table in the Exceptions table name property to which
the ROWIDs of the rows that violated the constraints will be stored, along
with the name of the constraint that they have violated. Doing this will
still result in the job failure if any of the constraints cannot be
enabled, but you will be able to determine which rows violated which
constraint. You can further configure the stage to delete the rows from
the table that violated the constraints and to try to enable the
constraints again but that will discard all the rows in the table found to
be in violation of the constraints. To store them somewhere instead of
discarding them you can define a reject link for the stage and enable
SQL error – constraint violation condition on it. This will
have a major impact on the performance because it enforces sequential
execution of the stage in a single player process. Rather than accepting
this restriction you should consider running the stage in parallel
execution mode and specifying Insert value for the Write
mode as that will potentially result in better overall
performance. In general, it is best to avoid the use of reject links if
possible. The section about reject links provides
some concrete ideas in this regard.
If you configure the stage to disable table constraints prior to the load
and enable them after the load and reject the records that have violated
constraints, all of the records in violation of
defined on the table will be rejected. The connector doesn't make
distinction between the rows that existed in the table prior to load and
those that were loaded by the connector. For example, if a row loaded by
the connector violated the primary key constraint because another row with
the same primary key value existed in the table prior to the load, both
rows will be deleted from the table and sent to the reject link. To have
the stage delete and reject only those rows in violation of constraint
that were not in the table prior to the load, you will need to implement a
custom solution such as the one shown in the next section.
Custom handling of indexes and constraints
When the connector stage is configured to handle triggers, indexes or
constraints automatically before and after the load, the connector will
accomplish this by issuing various Data Definition Language (DDL)
statements on the database. For enabling and disabling triggers, it will
ALTER TRIGGER statements, for rebuilding
indexes it will use
INDEX statements and
for enabling and disabling constraints it will use
ALTER TABLE statements. You can locate these
statements in the job log when you run the job with
CC_MSG_LEVEL environment variable set to value
2. The connector provides properties that you
can use to include or remove particular clauses in these statements. For
example, it provides properties that allow you to control the presence of
clauses in the
INDEX … REBUILD
In some cases, you may wish to issue your own statements for handling triggers, indexes and constraints instead of having the connector automatically generate and run them. For example, you may wish to reference table triggers, indexes and constraints directly by their names instead of having the connector query the database dictionary views to dynamically discover those objects. You may wish to manually include certain clauses in these statements that you cannot configure through the connector properties. You can achieve this outside of the stage by running external tools and scripts before and after running the job. You can do it through the stage by utilizing the Before SQL statement and After SQL statement connector properties. In each of these two properties you can specify either a list of semicolon-separated SQL statements or an anonymous PL/SQL block to be executed. The connector stage executes these statements in the conductor process. It will run the Before SQL statement statements before any player process has been created for the stage to process the records. It will run the After SQL statement statements after all the player processes have finished loading the records. The following example illustrates how this can be accomplished.
Suppose that you want to load records to the table
TABLE_ITEMS which stores information about
certain type of items. The table has the
column that represents item identifiers and serves as the primary key for
the table, which further means that the table has a unique index defined
on this same column. The table also has the
NAME column that stores item names and for
which a non-unique index is defined on the table. Finally, the table has
LOAD_DATE column which represents the date
and time when the item row was loaded. The statement shown in Listing 1 can be used to create this table.
Listing 1. Create table statement example
CREATE TABLE TABLE_ITEMS(ID NUMBER(10), NAME VARCHAR2(20), LOAD_DATE DATE); ALTER TABLE TABLE_ITEMS ADD CONSTRAINT PK_TABLE_ITEMS_ID PRIMARY KEY (ID); CREATE INDEX IDX_TABLE_ITEMS_NAME ON TABLE_ITEMS(NAME);
The load date is set when the job runs and is passed to the job as a job
parameter in format
Various mechanisms can be used to set and pass this value to the job. For
example, if you are invoking the job with
command from command line, the system date command can be used to produce
this value. For example, on Linux you can use the command shown in Listing 2 to run the job and pass the
LOAD_DATE job parameter value based on the
current system date and time.
Listing 2. dsjob command invocation
dsjob -run -jobstatus -userstatus -param LOAD_DATE="`date +'%F %T'`" project_namejob_name
Another option for passing the current date and time would be to create a
sequence job that contains a single activity stage which in turn invokes
the load job and initializes the
parameter to value
is a built-in DataStage macro that returns the job start timestamp.
Suppose the table contains many rows already and that you are running your
job to load additional rows. Further, most of the records that you are
loading are for new items, but some records represent existing items,
meaning they have the
ID field value for which
there is already a row in the table with that same
ID value. You want to load the data as fast as
possible, and you also want the constraints and indexes to be enabled and
valid for all the table rows after the load. If any loaded rows violate
the primary key constraint, you want to store them to an alternate table
TABLE_ITEMS_DUPLICATES which has the same
columns as the
TABLE_ITEMS table but does not
have any constraints and indexes defined on it, as shown in Listing 3.
Listing 3. Create table statement for duplicates
CREATE TABLE TABLE_ITEMS_DUPLICATES(ID NUMBER(10), NAME VARCHAR2(20), LOAD_DATE DATE);
You could achieve this task with the stage configured to disable
constraints before the load, skip all indexes during the load and rebuild
indexes and enable constraints after the load. You could define reject
link on the stage, select the SQL error – constraint violations
condition on the reject link and direct this link to another Oracle
Connector stage configured to insert rows to the
TABLE_ITEMS_DUPLICATES table. But this approach
would have the following limitations:
- Because you configured the stage to disable the constraints and reject rows that violated constraints after enabling them, the connector enforces sequential execution in a single player process. The rejecting of records on the reject link can only take place in a player process and in this case the rejecting is done when the connector enables the constraints after the load. For the player process to enable the constraints after it has finished loading the records, it needs to be certain that it is the only player process for the stage. But in practice, you may wish to run the stage in parallel execution mode so that you can utilize the available system resources on the engine tier.
- The connector rejects all the rows that violated the primary key
constraint including the rows that existed in the table prior to the
load but that happen to have the same
IDvalue as some of the rows that were loaded. This is because at the time the constraints are enabled all the rows are already in the table storage and no distinction is made at that time between the rows that were in the table prior to the load and those that were just loaded. In practice, you may wish to reject only the rows that were just loaded by the job and that violated the constraint but not any of the rows that existed in the table prior to running the job.
The following is one possible approach to avoid these limitations. Specify the statement shown in Listing 4 in the Before SQL statement property.
Listing 4. Statement to disable PK constraint
ALTER TABLE TABLE_ITEMS DISABLE CONSTRAINT PK_TABLE_ITEMS_ID;
This statement will explicitly disable the constraint
PK_TABLE_ITEMS on the table prior to loading
records to the table.
Set the Index maintenance option connector property to Skip all. Set the Allow concurrent load sessions connector property to value Yes. Do not define reject link for the stage. Configure the stage to run in parallel execution mode.
Define the exceptions table
TABLE_ITEMS_EXCEPTION. This table will be
needed for the
EXCEPTIONS INTO option of the
ENABLE CONSTRAINT clause in the
ALTER TABLE statement that we will want to
invoke (explained further down). The format that the exceptions table
needs to follow can be found in the SQL script
UTLEXCPT.SQL included with the Oracle database
product installation. Based on the information in this script, to create
TABLE_ITEMS_EXCEPTION exceptions table for
this example execute the statement shown in Listing 5.
Listing 5. Statement to create the exceptions table
CREATE TABLE TABLE_ITEMS_EXCEPTIONS(ROW_ID ROWID, OWNER VARCHAR2(30), TABLE_NAME VARCHAR2(30), CONSTRAINT VARCHAR2(30));
Specify the PL/SQL anonymous block shown in Listing 6 in the After SQL statement property. The connector will submit it to the database once all the records have been loaded. See the comments embedded in the PL/SQL code for details about individual operations performed by this PL/SQL block.
Listing 6. Custom PL/SQL code for handling indexes and constraints
DECLARE -- Define the exception for handling constraint validation error ORA-02437. cannot_validate_constraint EXCEPTION; PRAGMA EXCEPTION_INIT(cannot_validate_constraint, -2437); BEGIN -- Truncate the tables TABLE_ITEMS_DUPLICATES and TABLE_ITEMS_EXCEPTIONS in case -- they contain any rows from the previous job runs. EXECUTE IMMEDIATE 'TRUNCATE TABLE TABLE_ITEMS_DUPLICATES'; EXECUTE IMMEDIATE 'TRUNCATE TABLE TABLE_ITEMS_EXCEPTIONS'; -- Try to enable the PK_TABLE_ITEMS_ID constraint and to build the underlying -- unique index. If any rows are in violation of the constraint then store -- their ROWIDs to the TABLE_ITEMS_EXCEPTIONS table. EXECUTE IMMEDIATE 'ALTER TABLE TABLE_ITEMS ENABLE CONSTRAINT PK_TABLE_ITEMS_ID EXCEPTIONS INTO TABLE_ITEMS_EXCEPTIONS'; EXCEPTION WHEN cannot_validate_constraint THEN -- The constraint could not be enabled. The constraint violations need to be -- handled before trying again to enable the constraint. BEGIN -- Set the default date format for the session to match the format of -- LOAD_DATE job parameter value. EXECUTE IMMEDIATE 'ALTER SESSION SET NLS_DATE_FORMAT=''YYYY-MM-DD HH24:MI:SS'''; -- Copy rows from TABLE_ITEMS table to TABLE_ITEMS_DUPLICATES table that -- violated the constraint and were loaded by this job. The rows that were -- loaded by this job will have the LOAD_DATE column value that matches the -- LOAD_DATE job parameter value. INSERT INTO TABLE_ITEMS_DUPLICATES SELECT ID, NAME, LOAD_DATE FROM TABLE_ITEMS WHERE ROWID IN (SELECT ROW_ID FROM TABLE_ITEMS_EXCEPTIONS) AND LOAD_DATE='#LOAD_DATE#'; -- Delete the rows from TABLE_ITEMS that were copied to TABLE_ITEMS_DUPLICATES. DELETE FROM TABLE_ITEMS WHERE ROWID IN (SELECT ROW_ID FROM TABLE_ITEMS_EXCEPTIONS) AND LOAD_DATE='#LOAD_DATE#'; -- Try to enable the constraint again. This time the operation should be -- successful. EXECUTE IMMEDIATE 'ALTER TABLE TABLE_ITEMS ENABLE CONSTRAINT PK_TABLE_ITEMS_ID'; -- Rebuild the non-unique index IDX_TBL_ITEMS_COL_NAME that was disabled -- during the load. EXECUTE IMMEDIATE 'ALTER INDEX IDX_TABLE_ITEMS_NAME REBUILD PARALLEL NOLOGGING'; END; END;
Utilizing Oracle date cache
You can configure the stage to take advantage of the Oracle date cache
feature by setting the Use Oracle date cache property to value
Yes. This feature is only available in bulk load mode
and can significantly improve the load performance when the input link has
one or more
Timestamp columns, those columns are used to
load values to
TIMESTAMP table columns and
there are many repeating values for those columns.
The connector prepares values for
table columns in a format that requires them to be converted prior to
being stored in the table. When the date cache is utilized the result of
the conversion is cached so when the same value appears again on the input
the cached result is used instead of performing the conversion again.
You can specify the number of entries in the date cache through the
Cache size property. To avoid cache misses when looking up
the values in the cache, set this property a value no smaller than the
total number of distinct values expected to be encountered on the input
Timestamp link column.
You can also choose to disable the date cache if it becomes full during load by setting Disable cache when full property to Yes. One scenario where this can be useful is when you run the job repeatedly to load records to the target table. Most of the times the total number of distinct input values is smaller than the selected cache size but occasionally most of the input values are unique. In that case, disabling the cache when it becomes full will avoid performing lookups on the full date cache, which will improve the performance because each of the those lookups would result in a cache miss.
Note that each player process for the stage will use its own date cache when the stage is configured to run in parallel execution mode.
In Information Server Version 9.1 when
Timestamp input link columns are used to load
DATE table columns instead of
TIMESTAMP table columns, the values provided by
the connector are in the format that does not require conversion prior to
storing them in the target table. Therefore, the use of date cache feature
will not provide much benefit in this case. In Information Server releases
prior to Version 9.1, the conversion will be needed in this case so the
use of date cache will provide benefits similar to when loading values to
TIMESTAMP target table columns.
Disabling redo logging
Database redo log tracks changes made to the database and is utilized in the database recovery process. When the connector is loading data to the target table, the redo log is updated to reflect the rows loaded to the table. In case of the database storage media failure, after the database is restored from the most recent backup, the rows loaded by the connector since the backup was taken are restored from the redo log. The ability to restore loaded rows from the redo log leads to potential decrease in performance because the load operation takes longer to complete when the redo log needs to be maintained and capture the data loaded to the table.
To speed up the load operation performed by the connector you can disable
the redo logging by setting the
attribute on the target table. You can set this flag when you create the
table, or if you wish to set it only during the load you can do that by
using the Before SQL statement and After SQL statement
connector properties. In the Before SQL statement property
specify the statement
ALTER TABLE table_name NOLOGGING and
in the After SQL statement> property specify the statement
ALTER TABLE table_name LOGGING, where
table_name is the name of the table to
which the stage is loading data. Another option is to use the Disable
logging connector property – when this property is set to
No then the logging flag (
NOLOGGING) is used, and when it is set to
Yes, the redo logging is disabled.
Disabling redo logging for the bulk load can significantly improve the load performance. In case of the media failure the data that was loaded while redo logging was disabled can be restored from the database backup if the backup was taken after that load but it cannot be restored from the redo log. If restoring loaded data from the redo log in case of a media failure is not required, or if the data can be easily re-loaded in that case, then disabling redo logging should be considered for speeding up the load operation. Refer to the Resources section for the link to Oracle Database online documentation where you can find more information about redo log and considerations related to disabling the redo logging.
Loading records to a staging table
If you wish to use the connector in bulk load mode to quickly transfer the
data to the target database but at the same time you need the target table
triggers, indexes and constraints to be maintained as if a conventional
insert statement was used, one option for you to consider is to initially
load the records to an intermediate staging table that has the same
definition as the target table, but does not contain any triggers, indexes
and constraints. After the load to the staging table completes, issue a
INSERT statement that reads rows from the
staging table and writes them to the target table. This statement can be
specified in the After SQL statement connector property so that
it is automatically invoked at the end of the job. The use of the staging
table results will cause data to be written twice (once when loaded by the
connector to the staging table and once when copied to the target table by
the follow-up statement) but will, at the same time, have the following
- The records will be processed only once by the job. This is important when the processing of records in the job is resource intensive.
- The records will be moved from the DataStage engine hosts to the database host only once.
INSERTstatement will be executing on the database side so during that time the DataStage engine hosts will be available to process other workloads.
- You can use database SQL functions in the
INSERTstatement to perform additional transformation on the values that were loaded to the staging table prior to storing them to the target table.
- If the
INSERTstatement that copies the values fails for some reason, the loaded records will remain available in the staging table and you can issue another
INSERTstatement after handling the rows in the staging table that have caused the error. You can utilize the
LOG ERRORS INTOclause in the
INSERTstatement where the rows that cause the
INSERTstatement to fail are directed to the error logging table along with the details indicating the reason for their error.
- The indexes, constraints and triggers on the target table will remain enabled, valid and enforced at all times.
You can configure Oracle Connector stage to run in manual load mode in which case the stage will write records to a file on the engine tier host that can then be passed to the Oracle SQL*Loader utility to load the records to the target table. To configure the stage in this mode set the Manual load property to Yes. Loading records in manual load mode is generally slower than loading them directly to the target table but in some cases can be very useful, as shown next.
One example would be if you want to run the job to extract and process the data at a time when the target database is not available for load. You can configure the stage to store the data to a file and then later, when the database is available, you can load the data to it using the SQL*Loader utility. The data will be already processed at that time and ready to be stored to the target table. In cases, when the processing phase of the job takes a long time to complete, such as for example when pulling data from an external source that is slow or when complex and CPU intensive transformations are involved in the job, the manual load approach can result in significant time savings.
Note that the connector actually generates two files in this mode – the data file with the records to be loaded and the control file used to pass options to the SQL*Loader and inform it about the location and format of the data file. You can also transfer the generated files to a host that doesn't have DataStage installed and load the records from there, presuming that you have SQL*Loader tool available on that host. If that host is the Oracle server host the load operation will be local to the database.
Restarting the load and backing up the data that needs to be loaded is also convenient with this approach because the generated data file is already in the format supported by the SQL*Loader tool.
Note that when the stage is configured to run in manual load mode it will try to connect to the database even if it will not be loading records to the database on that connection. If the target database is not available to accept connections at the time when you run the job, you can work around the problem by pointing the stage to another database that is available at that time, so that the stage can establish a connection and complete the manual load operation.
If it is critical to ensure the stage does not connect to any database in manual load mode, you can accomplish that by ensuring that the following conditions are met:
- The connector is not collecting operational metadata. To ensure this
DS_NO_PROCESS_METADATAenvironment variable for the job and set it to value
- The connector does not issue statements on the database to be
interpreted as events by the IBM InfoSphere Guardium Activity Monitor.
To ensure this, ensure that you don't have
CC_GUARDIUM_EVENTSenvironment variable set for the job.
- The Run before and after SQL statements property must be set to No. This ensures that the connector is not trying to perform any SQL statements on the database prior to and after loading the records to a file.
- The Table action property must be set to Append. This ensures the connector is not trying to create, drop or truncate a database table prior to performing the load operation.
- The Perform operations before bulk load and Perform operations after bulk load properties must not specify any index and constraint maintenance operations.
When all of these conditions are met then the connector will load records to the data file without connecting to the database. You will still need to provide Username and Password property values though in order to be able to compile the job. You can set them to some dummy values or you can set the Use external authentication property to value Yes which will disable the Username and Password properties so you will not need to provide values for them.
Note that even if you configure the stage not to connect to the database, you will still need to have Oracle Client product installed on the DataStage engine host on which the connector runs, otherwise the connector will fail the initialization when the job starts and the job will fail.
Array size and Buffer size
When the Oracle Connector stage is configured to run in bulk load mode, it reads records from the input link and stores them in the Oracle specific direct path column array. The connector does not create this array in memory like the array that it uses for fetch and DML operations. Instead, Oracle client creates this array on behalf of the connector. The size of this array in number of records is specified through the Array size property. The default value is 2000.
When the column array becomes full the connector requests from Oracle client to convert the array to the internal stream buffer format and to send this buffer to the Oracle server. The size of this stream buffer (in kilobytes) is specified through the Buffer size connector property. The default value for this property is 1024 (1 megabyte).
Depending on the length of the records, it is possible that the specified Buffer size value will not be sufficient to store the Array size number of records in stream format. In that case, the Array size will be automatically reduced to a smaller value and Oracle Connector will log a message to indicate the newly enforced number of records that it will load at a time. Note that regardless of the specified Buffer size value the maximum Array size value will be 4096.
When configuring the Array size and Buffer size values in order to achieve optimal performance, you can apply the following approach:
- Set the Array size to 4096 (maximum supported) and Buffer size to 1024.
- Run the job and check if the connector reported a message in the job log that it will load less than 4096 records at a time. If it did then go to step 3. If it did not then go to step 4.
- Continue trying with increasingly larger Buffer size values until the performance worsens or the connector stops reporting the message, whichever takes place first.
- Continue trying with decreasingly smaller Buffer size values until the performance worsens.
Loading records with the stage in parallel execution mode
The connector can be configured to run in parallel execution mode and load data from multiple player processes with each player process loading a subset (partition) of input link records to the target table. In this case, parallel load must be enabled for the target table segment. To do this set the Allow concurrent load sessions connector property to value Yes. If this property is set to No and you attempt to load records from multiple player processes, you will receive Oracle error "ORA-00604: error occurred at recursive SQL level 1" possibly accompanied by the Oracle error "ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired".
The exception to the above rule is when the stage is loading rows to a Range, List or Interval partitioned table (which can optionally be subpartitioned) and the Partition type setting for the link is set to Oracle Connector. This exception scenario is covered in the next section which discusses loading records to a partitioned table.
Setting Allow concurrent load sessions property to Yes has implications on how indexes are handled during load. For more information refer to the earlier section which discussed the handling of indexes.
Loading records to a partitioned table
When you wish to load records to a single partition or subpartition of the target table, set the Table scope connector property to Single partition or Single subpartition value, and specify the name of the partition or subpartition in the Partition name or Subpartition name property. Doing this will result in locking only the segment of the specified partition or subpartition and will also return error if any input record needs to be stored to another partition, that way providing verification mechanism that the records belong to the partition or subpartition specified in the stage.
When you wish to load records to multiple partitions of a table, set the Table scope connector property to value Entire table. In respect to triggers, indexes and constraints defined on the table, the concepts that apply to non-partitioned tables apply in this case as well. When rebuilding indexes on partitioned tables, the connector will automatically rebuild only those index partitions or subpartitions that have been marked unusable during the load process. This applies to both locally and globally partitioned indexes.
When you wish to load records in parallel and make sure that each player process is loading records to a dedicated table partition, you can do that for a table partitioned based on Range, List or Interval partitioning strategy (and is optionally subpartitioned) by specifying the Oracle connector value for the Partition type setting for the input link. In this case, each player process of the stage will be associated with a single dedicated table partition and will load data to that partition's segment (or its set of subpartition segments if the table is composite-partitioned) and because no two player processes will be accessing the same segment, the Allow concurrent load sessions connector property can be set to No. The exception is when the table has a globally partitioned index. In that case, the job will fail with the Oracle error "ORA-26017: global indexes not allowed for direct path load of table partition string".
When considering using Oracle connector value for Partition type when loading records to a partitioned table, consult the list provided in Part 1 article in the section that covered DML operations on a partitioned table.
Reject link considerations
Reject links are a convenient mechanism for handling input records for which the database reported an error when it tried to perform the operation specified in the stage. You can enable reject conditions on the reject link under which the records in error should be rejected, you can choose to include error code and error text information with the rejected records and you can point the reject link to another stage to store the rejected records to a location of your choice, such as a file or another database table.
However, reject links can have negative effect on the performance. When the stage has a reject link defined, it needs to inspect the error code for each record reported by the database to be in error, compare it to the conditions defined on the reject link and submit the record to the reject link, potentially accompanied by the error code and error text information. These operations consume system resources and affect the job performance as a whole.
You saw earlier that in order to use reject links to handle records with
constraint violations in bulk load write mode, you must run the stage in
sequential execution mode. Doing this will result in rejecting all the
rows from the table that violated the constraints, including those rows
that were present in the table before the load. Also, when you run
INSERT statements with the
APPEND_VALUES optimization hint you cannot rely
on the reject link functionality.
To avoid these limitations associated with the use of reject links, avoid using the reject links altogether and try to eliminate the records in error BEFORE they reach the Oracle Connector stage or to eliminate them AFTER the stage has completed the operation for which it was configured.
As an example of handling records in error BEFORE they reach the Oracle
Connector stage, let us consider the
CHECK constraints in a table created using
the statement shown in Listing 7.
Listing 7. Statement to create test table
CREATE TABLE TABLE_TEST(C1 NUMBER(10) CONSTRAINT CONSTR_NOT_NULL_C1 NOT NULL, C2 VARCHAR2(20), C3 DATE, CONSTRAINT CONSTR_CHK_C1 CHECK (C1 > 20000 AND C1 < 70000));
The constraints on this table require that the values for field
C1 are not
that they are between 20000 and 70000 (exclusive). Instead of configuring
the Oracle Connector to use a reject link for handling these constraints,
the constraint checking can be done in the stages upstream of the Oracle
Connector stage. For example, you can use the Filter stage with one output
link leading to the Oracle Connector stage and another reject link leading
to a Sequential File stage (or any other stage to which you wish to direct
the bad records). In the Where Clause property of the Filter
stage you would specify a condition as the one shown in Listing 8.
Listing 8. Filter stage condition
C1 IS NOT NULL AND C1 > 20000 AND C1 < 70000
You would also set the Output Rejects property of the Filter stage to value True. The Filter stage would pass to its output link the records that meet this constraint, and would send the remaining records to the reject link.
This configuration is shown in Figure 2.
Figure 2. Handling constraint violations with a Filter stage
For more complex constraint expressions or to annotate rejected records
with the information describing the reason why they were rejected, you
could use the Transformer stage. For the previous example, the Transformer
stage would have one input link (for example, input_records) with columns
C3 and would have three output links. All three
output links would copy the columns from the input link. The first output
link would have the constraint
IsNull(input_records.C1) defined and an extra
column to contain the constant literal value
"C1 value is null.". The second output link
would have the constraint
20000 Or input_records.C1 >= 70000
defined and an extra column to contain literal value
"C1 value: " : input_records.C1 : " is not
between 20000 and 70000.".
The first and the second link would lead to the Funnel stage that would
combine the records on them and store them to a reject file using the
Sequential File stage. The third output link would lead to the Oracle
Connector stage that would write them to the table. This configuration is
shown in Figure 3.
Figure 3. Handling constraint violations with a Transformer stage
The following are some examples of handling records with errors AFTER the stage has completed moving the records to the database:
- When bulk loading to the target table with constraints, load the
records to a staging table instead that doesn't have the constraints
defined on it and then issue a follow up
INSERTstatement to move the records to the target table. Specify
LOG ERRORS INTOclause in the insert statement so that the records in error are directed by to an alternate table. For more information, refer to the earlier section that covered loading of records to a staging table.
- When bulk loading to the target table with constraints, configure the stage to disable the constraints before the load and enable them after the load, specify the exceptions table but do not configure the stage to automatically process the exceptions. Instead, process them manually after the job completes. Alternatively, perform custom handling of constraints and indexes using a strategy such as the one suggested earlier in the article.
In this section, we will look at how selection of data types for the link columns can affect the connector performance. The default link column data types are provided by the connector when it imports the table definition. They are also provided by the connector at runtime when the stage has an output link without columns defined on it and the runtime column propagation feature is enabled for the link.
The default link column data types represent the closest match for the corresponding table column data types in terms of the ranges of values that they support. In some cases, you may have an additional insight regarding the format and range of the actual data values processed by the stage which may in turn allow you to select data types for the link columns that result in better performance in the connector. In this section, a number of such scenarios are examined.
You should always start with the link columns of default data types. To
determine the default data types, use the connector to import the table
that you will be accessing from the stage and look at the column
definitions in the imported table definition. If you configured the stage
to issue a complex
SELECT statement that
references multiple tables then you will not be able to import a single
table definition for that statement. What you can do in that case is
create a temporary view based on that statement and then import the view
instead. When your stage is configured to execute a PL/SQL anonymous
block, you will not be able to import a table or a view to determine the
default link column data types. In this case, you can analyze the PL/SQL
block and for each bind parameter in the PL/SQL block (specified in
ORCHESTRATE.param_name format) determine the
object in the database to which it corresponds. In some cases, this object
will be a column in a table or a view, in some cases it will be a
parameter of a stored procedure, but in all cases that object will have an
associated Oracle data type. Create a temporary table with the columns of
those same data types and import that table.
Refer to the Resources section for the link to the Information Server Information Center where you can find the table that shows how Oracle Connector maps Oracle data types map to DataStage data types by default.
Once you have placed columns of default data types on the link, run the job and measure the performance. Then apply the information from the remainder of this section to further optimize the job by changing link column data types in such a way so that the performance improves while no loss of data occurs.
Representing CHAR, VARCHAR2, NCHAR and NVARCHAR2 table column data types
In this section, we will look at the
NVARCHAR2 Oracle data types.
When you use Oracle Connector to import tables with columns of these four
data types, the connector will model
VARCHAR2 table columns as DataStage
link columns, and it will model
NVARCHAR2 table columns as DataStage
link columns. The exception to this rule is the case when
table column definitions are based on the character length
semantics and the database character set is a multi-byte character set
AL32UTF8. In that case,
columns are modeled as
NVarChar link columns.
If your database character set is a multi-byte character set such as
AL32UTF8 and your table columns use character
length semantics, such as a column defined to be of
CHAR) type, and
all the actual data values use only single-byte characters, then use the
columns on the link (with Extended attribute left blank) instead of the
Minimize the character set conversions in the stage by ensuring that the
NLS_LANG environment variable value
for the job is compatible with the NLS map name specified for the job.
This is important not just for the performance reasons, but for the
correct functioning of the stage in regard to interpretation of character
set encoding of the values. To determine the effective
NLS_LANG value for the job, run the job with
CC_MSG_LEVEL environment variable set to
2 and inspect the job log. The connector will
log the effective
NLS_LANG environment variable
value. If you wish to use a different value, you will need to define
NLS_LANG environment variable for the job and
set it to the desired value. Oracle Connector stage editor does not have
the NLS tab so the NLS map name selected for the job is used for
the stage. You can specify this value for the job by opening the job in
the DataStage Designer, selecting the menu option Edit
and then clicking Job properties and the
NLS page in the dialog and providing the value in
Default map for stages setting. Table 1 shows examples of compatible NLS map name and
NLS_LANG values. Note that you should set the
language and category portions of the
value to match the locale of your DataStage engine host.
Table 1. Compatible NLS map name and NLS_LANG values
|NLS map name||Compatible NLS_LANG|
If the values processed by the stage are based on only ASCII characters,
set the NLS map name to
ASCL_ASCII and set
AMERICAN_AMERICA.US7ASCII. Otherwise if
possible set the
NLS_LANG value to match the
database character set. For example, if the stage is writing records to a
database with the
AL32UTF8 database character
set, use the
NLS_LANG value of
AMERICAN_AMERICA.AL32UTF8 and make sure the NLS
map name for the job is set to
determine the database character set you can connect to the database from
SQL*Plus and issue the statement shown in Listing 9.
Listing 9. Determining the database character set
SELECT VALUE FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER='NLS_CHARACTERSET';
Always make sure that the values in the job represented by the
link columns are actually encoded as indicated in the NLS map name and
NLS_LANG settings and that the values for
link columns are UTF-16 encoded. Note that the connector handles
link columns which have the Extended attribute set to Unicode the same way
NVarChar link columns.
Make sure that all
NVarChar link columns have Length attribute set
to a concrete value. Leaving the Length attribute empty can have severe
negative impact on the performance of the stage. When the Length is not
specified, the connector makes assumption that the length is the maximum
one allowed for the respective bind parameters, which is 4000 bytes. The
connector will therefore allocate the memory with the assumed column
lengths, which may result in large memory consumption. Suppose that the
connector is configured to insert rows to a database table with the
default Array size of 5000, and the link contains 50
VarChar columns that do not have the Length
attribute set. The connector will allocate (in each player process)
approximately the total of 50 (columns) x 5000 (array size) x 4000
(maximum value length) bytes which is close to 1 gigabyte. If the actual
table columns in the database are defined as
VARCHAR2(50 BYTE) then by setting the Length
attribute of the link columns to 50, the Array size could be left
set to 5000 and the size of the allocated memory per player process would
drop to 50 (columns) x 5000 (array size) x 50 (maximum value length) bytes
which is about 12 megabytes.
Do not use
LongNVarChar link columns to represent
table columns. These link columns are appropriate for handling table
XMLType columns as discussed later in the text. Even in those cases, if the
actual values for those columns are less than 4000 bytes in size, the
link columns should be used instead of
LongNVarChar link columns. When
LongNVarChar link columns are used, the
connector disables array processing and enforces the Array size
value of 1, even if you specify small Length attribute values for those
link columns. Enforcing the Array size of 1 can have a major
negative impact on the performance.
Ensure that the connector is using optimal character set form for the
link columns. Character set form is an Oracle database concept and can be
one of two values:
NCHAR. The optimal character set form for the
connector to use for
VARCHAR2 table columns is
IMPLICIT and for
table columns it is
The connector will automatically use optimal character set form if you use
link columns for
VARCHAR2 table columns and
link columns for
NVARCHAR2 table columns. As stated earlier,
VarChar link columns with Extended attribute
set to Unicode are treated as
NVarChar link columns.
In some cases, you may end up using link columns that result in
sub-optimal character set form selection in the connector. For example,
when you use the connector to import a table with
VARCHAR2 columns that are based on character
length semantics and the database is based on
AL32UTF8 (multi-byte) character set, the
connector will import that column as
link column. By importing it this way the connector is able to set the
Length attribute of the link column to match the length specified in the
table column definition while ensuring that data truncation does not
happen at runtime. At runtime the connector will choose the character set
NCHAR for this link column which will not
be the optimal choice for the
table column. If this link column happens to be referenced in a
WHERE clause in the statement specified for the
stage then any index defined on this table column will not be utilized.
This applies to
SELECT statements used by the
stage configured for sparse lookup mode, as well as
statements when the connector is configured to write records to the
database in Insert, Update, Delete, Insert
then update, Update then insert or Delete then
insert write modes. The skipping of the index in this case can
have significant performance implications. To force the connector to use
the optimal character set form in this case, you could apply one of the
- Change the link column data type from
VarChar. In this case, you will also need to update the Length attribute for this link column by multiplying it by 4 because UTF-8 characters can consume up to 4 bytes. As we mentioned before, if you know that all the actual data values processed by the stage use single-byte characters, then you can leave the Length attribute unchanged.
- Change the data type of the table column to
NVARCHAR2. In this case, the
NCHARcharacter set form selected by the connector will be the optimal choice. In many practical cases though, making changes to the existing column definitions in the database tables is not an option.
- Utilize the
CC_ORA_BIND_FOR_NCHARSenvironment variable. Set the value of this environment variable for the job to contain comma-separated list of link column names for which you wish the connector to use
NCHARcharacter set form. If you specify the special value
(none)for this environment variable, then the connector will use
IMPLICITcharacter set form for all columns on the link. Another special value is
(all)(the parenthesis are part of the value) which will result in the connector using
NCHARcharacter set form for all the columns on the link. If none of the tables in your database have
NCLOB(discussed later) columns then you can define
CC_ORA_BIND_FOR_NCHARSenvironment variable at the DataStage project level and specify
(none)as its default value (again the parenthesis are the part of the value). That way the connector will always use
IMPLICITform which will always be the optimal character set form.
Representing NUMBER, FLOAT, BINARY_DOUBLE and BINARY_FLOAT table column data types
In this section we will take a look at the NUMBER, FLOAT, BINARY_DOUBLE and BINARY_FLOAT Oracle data types.
NUMBER(p, s) table column
p and scale
s are explicitly specified, the
default and at the same time the optimal link column data type is
Decimal with Length attribute set to
p and Scale attribute set to
s. The exception to this rule are the
s < 0 and where
p. In those cases, the values need to
be provided to the stage and retrieved from the stage in text format so
the link column should be of
VarChar type with
the Length attribute value sufficiently large value to accommodate all
possible values for the respective column in the table.
NUMBER(p) table column where
p is explicitly specified
s is omitted, the default
link column data type is
Decimal with the
Length attribute set to
Scale attribute left empty. If the
remaining stages in the job are operating on integer
values for these columns then you may be able to achieve better
performance if you preserve those integer data types for the columns on
the link of the Oracle Connector stage than if you change them to the
Decimal data type. This is especially the case
when the connector stage is writing to a partitioned table and the
corresponding table column is used as part of the partition key of the
table and you specified Oracle connector value for the
Partition type setting of the link. If you decide to replace
Decimal(p) link columns with integer
link columns keep in mind the differences between these link data types.
Although they all represent integer values, the ranges of values that they
For example, if a table column is defined as
NUMBER(5) and you choose to replace the default
Decimal(5) link column with a
SmallInt link column, the values supported by
NUMBER(5) table column will be in the range
[-99999, 99999] and the values supported by the
SmallInt link column will be in the range
[-32768, 32767]. If you set the Extended attribute of the link column to
Unsigned, the values that are supported by the link column will be in the
range [0, 65535]. In both cases, the range of the link column will be a
sub-range of the respective table column. So you should only use this
approach if you are sure that all the values that will be processed by the
stage belong to both ranges. In the previous example, you may choose to
SmallInt link column (signed or unsigned)
when writing values to the database, but to use
Integer column (signed) when reading values
from the database because the range of the supported values for the
Integer (signed) link column is [-2147483648,
2147483647] and is therefore sufficient to support any value supported by
NUMBER(5) table column. Note that if you
choose to use
BigInt column on the link, the
range of supported values for the link column will be
[-9223372036854775808, 9223372036854775807]. And if the Signed column
attribute is set to value Unsigned the range will be [0,
18446744073709551615]. However, the choice of
BigInt link column will likely result in
smaller performance gains than if
link column is used, and may even produce worse performance than the
Decimal link column. This is because the
connector will exchange
BigInt link columns
values with the database as text values which will involve additional data
NUMBER table column where both
p and scale
s are omitted, the default link column
data type is
Double. But if you are certain
that the actual values in the table are all integer values, you may choose
to use link column of one of the integer data types
you may choose to represent the values as decimal values with a specific
precision and scale by using a link column of
Decimal data type. Keep in mind that the
supported ranges of these link column data types are not fully compatible
so only use these link column data types in cases when they will not
result in the loss of data precision due to the rounding or in cases when
such a rounding is acceptable.
BINARY_FLOAT table columns, the default and
optimal column types to use for the corresponding link columns are
Representing DATE, TIMESTAMP and INTERVAL table column data types
In this section we will take a look at the DATE, TIMESTAMP and INTERVAL Oracle data types.
TIMESTAMP(0) WITH TIME ZONE and
TIMESTAMP(0) WITH LOCAL TIME ZONE table columns
the default link column data type is
with Extended attribute left empty. This is the optimal link column type
to use in case when hour, minute and second portions of the values need to
be preserved. Otherwise, the
Date link column
will likely result in better performance because the stage does not need
to handle hours, minutes and seconds in the values. Note that if you need
to preserve time zone information in the values you need to use character
based link columns such as
VarChar. This is
covered in more detail later in this section.
TIMESTAMP(p) WITH TIME ZONE and
TIMESTAMP(p) WITH LOCAL TIME ZONE
table columns where precision
0, the default link column data type is
Timestamp with Extended attribute set to
Microseconds. The processing of
columns with Extended attribute set to Microseconds takes considerably
more time in the Oracle Connector than the processing of
Timestamp link columns which have the Extended
attribute left empty. This is especially the case for
SELECT and DML
DELETE) statements and to a lesser extent bulk
load operation. Avoid the use of
columns with Extended attribute set to Microseconds unless you absolutely
need to support fractional seconds in the timestamp values.
The connector models
TIMESTAMP(p) WITH TIME ZONE and
TIMESTAMP(p) WITH LOCAL TIME ZONE
table columns with
p > 6 as
Timestamp link columns with Extended attribute
set to Microseconds. The connector will in this case perform truncation of
fractional seconds to the microsecond precision. To handle timestamp
values with fractional second precision greater than 6 up to and including
9, use the
VarChar link column. In that case,
the connector will exchange timestamp values with Oracle client as text
values. Ensure in this case that the
NLS_TIMESTAMP_FORMAT session parameter is set
to match the format of the actual values provided to the stage on its
input link. The same format will be used by the stage for the values that
it provides on its output link. You can set this session parameter using
the Before SQL (node) statement property, by issuing an
ALTER SESSION statement. For example, to handle
timestamp values as text values with 9 fractional digits, a statement like
the one shown in Listing 10 could be specified in
the Before SQL (node) statement property.
Listing 10. Setting the timestamp format with 9 fractional seconds for the session
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF9'
You can apply this same approach for
INTERVAL YEAR (yp) TO
INTERVAL DAY(dp) TO
table columns. For
TIMESTAMP(p) WITH TIME ZONE table
columns set the
parameter to the appropriate format, which includes the time zone format
if the time zone needs to be preserved in the values. The
INTERVAL DAY (dp) to
values should be specified as interval literals in the format defined in
the Oracle database. Refer to the Resources
section for the link to Oracle Database online documentation where you can
find more details and examples for
Oracle data types and interval literals.
If you are using
Timestamp (with Microseconds)
link columns because you need to preserve microseconds in timestamp
values, you can again apply this same approach as it can potentially lead
to improved performance. For example if you are fetching values from a
TIMESTAMP table column and the downstream stage
is capable of handling timestamp values in text format, then replace the
Timestamp (with Microseconds) column on the
output link of Oracle Connector stage with a
VarChar column so that the connector provides
timestamp values in text format. Likewise, if you are inserting values to
TIMESTAMP table column and the upstream stage
can provide the timestamp values in text format, replace the
Microseconds) column on the input link of
Oracle Connector stage with a
VarChar column so
that the connector accepts timestamp values in text format. The Length
attribute of the
VarChar column and the
NLS_TIMESTAMP_FORMAT session variable must be
appropriate for the format of the timestamp text values. For example, to
use the default
format for the timestamp values, set the Length attribute of the
VarChar column to 26 and specify the statement
shown in Listing 11 in the Before SQL (node)
statement connector property.
Listing 11. Setting the timestamp format with 6 fractional seconds for the session
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF6'
Even in the cases when the remaining stages of the job are not capable of
handling timestamp values with microseconds in text format and require
Microseconds) link column is used, you can
choose to map this column to a
explicitly for the Oracle Connector stage by using a Transformer stage as
- If the link column is located on the output link of the Oracle
Connector stage and the link name is out_link, insert a Transformer
stage between the Oracle Connector stage and the next stage in the job
flow and map the
VarChar(26)column on the input link of the Transformer to the
Timestamp(with Microseconds) column on the output link of the Transformer stage using the derivation shown in Listing 12.
Listing 12. Converting string to timestamp
- If the column is located on the input link of the Oracle Connector
stage and the link name is in_link, insert a Transformer stage between
the previous stage in the job flow and the Oracle Connector stage and
Timestamp(with Microseconds) column on the input link of the Transformer stage to the
VarChar(26)column on the output link of the transformer stage using the derivation shown in Listing 13.
Listing 13. Converting timestamp to string
Representing RAW table data type
In this section we will take a look at the
Oracle data type.
VarBinary link columns to read and write
RAW table column values in their native raw
binary form. For the Length attribute of the link columns specify the
sizes of the respective
RAW table columns.
VarChar link columns if you need to
read or write binary values as sequences of hexadecimal digit pairs. In
this case, make sure to set the Length attribute of the links columns to a
value twice the size of the respective
table columns because each byte in a
column value will be represented by a pair of hexadecimal digit (0-9 and
A-F) characters. The use of
columns will allow you to read and write
table column values in hexadecimal digit pair's form without a need to
handle the conversions in another stage in the job.
Representing LONG, LOB and XMLType table column data types
In this section we will take a look at the
Oracle data types.
The preferred link column data types for these table column data types are shown in Table 2.
Table 2. Preferred link column data types for LONG, LOB and XMLType table column data types
|Table column data type||Preferred link column data type|
|BLOB, LONG RAW||LongVarBinary|
|CLOB, LONG, XMLType||LongVarChar|
You can use other combinations of link and table column data types as well but doing that may negatively affect the performance and will likely require you to use additional data type conversion functions in your SQL statements in order to avoid data type mismatch errors.
Every time you have
LongNVarChar columns defined on the link the
connector enforces Array size value of 1 for that link
which has negative effect on performance. This is the main reason why
LongNVarChar link columns should only be used
when handling the table column data types covered in this section.
When the values represented by the
LongNVarChar link columns are 4000 bytes long
or less, you should use
link columns instead, respectively, even when the corresponding table
columns are of the data types discussed in this section. When you do that
the connector will not enforce the Array size value of 1 which
will typically result in significant performance improvements. Also the
guidelines presented earlier for character and
binary table column data types will apply in that
case instead of the guidelines presented in the remainder of this section.
When you configure the connector stage to fetch values from a table column
represented by a
LongNVarChar link column, you have two choices
for how the connector should pass the fetched values for that column to
the output link. You can configure the stage to pass the values inline
which means that each value is passed in its original form to the output
link. Alternatively, you can configure the stage to pass the values by
reference which means that a locator string representing the location of
each value is passed instead of the actual value. The actual value
represented by the locator string is then fetched by the last downstream
stage in the job that stores this value to its intended destination. The
locator string contains information that specifies how to retrieve the
actual value from its source.
By default, the Oracle Connector stage passes values inline. To pass the
values by reference you need to set the Enable LOB references
connector property to Yes, and then in the Columns
for LOB references connector property specify the names of the
input link columns for which you wish the connector to pass the values by
reference. You can use the following guidelines when deciding to pass
values by reference or inline for the
LongNVarChar link columns:
- Do the values only need to be passed through the job or do they also need to be interpreted (parsed or modified) by certain stages in the job? In the latter case, do not pass the values by reference because it is not the values that are passed in that case but the locator strings that represent them.
- Which stage in the job is the consumer of the values? Only stages of
connector-based stage types such as WebSphere MQ Connector, DB2
Connector, ODBC Connector, Teradata Connector and Oracle Connector are
capable of recognizing locator strings and resolving them to actual
values. Stages of other stage types will treat locator strings as the
actual field values. For example, you can use Oracle Connector stage
CLOBvalue by reference to a DB2 Connector stage to store the value to a
CLOBcolumn in the DB2 database. But if you pass that value by reference to a Sequential File stage, the locator string will be stored in the target file and not the actual
CLOBvalue. You can use the latter scenario to check the size of the locator strings so that you can compare them to the average size of the actual values and estimate the savings in terms of the amount of data transferred through the job when the locator strings are passed instead of the actual values. Typically the locator strings will be less than 2000 characters in length.
- How large are the actual values? The larger the values are the more likely that passing them by reference will result in performance gains. As a rule of thumb values that measure about 100 kilobytes should be passed inline otherwise they should be passed by reference. Understanding the specifics of your environment in combination with the actual testing that you perform should help you with this decision. For example, if the Oracle Connector stage that reads the value from the database runs on a DataStage engine host which is different from the host in which the downstream stage consumes those values, then passing values by reference may provide performance benefits even for smaller actual values (as long as they are still larger than the locator strings) because they will be need to transferred over the network as they get exchanged between the player processes of the respective stages.
Note that DataStage imposes a limit on the size of records transferred
between the player processes in the job. The
DataStage environment variable can be used to control this limit and by
default it is set to
131072 bytes (128
kilobytes). For example, if you pass link column values inline and the
values consume hundreds of kilobytes then you will need to override this
environment variable and set it to a sufficiently large value, otherwise
you will receive error message at runtime indicating that the record is
too big to fit in a block. Set the value for this environment variable to
accommodate the largest record that will be passed through the job. Do not
set it to a value larger than necessary because the value will apply to
all the player connections on all the links in the job so the increase of
this value may lead to a significant increase of the memory usage in the
Oracle Connector stage handles values represented by
LongNVarChar link columns as follows:
- For the read mode (fetching) it utilizes one of the following two Oracle mechanisms - piecewise fetch or LOB locators. The connector inspects the source table column data type and automatically selects the mechanism that is more suitable for that data type.
- For the bulk load write mode it utilizes the Oracle direct path interface when loading large values. Again the connector does this automatically and there are no additional options to consider.
- For the write modes other than bulk load it utilizes one of the
following two Oracle mechanisms - piecewise inserts/updates or OCI LOB
locators. By default it uses piecewise inserts/updates. To force the
connector to use OCI LOB locators for some of the link columns,
specify those link column names in a comma-separated list and set that
list as the value of the
CC_ORA_LOB_LOCATOR_COLUMNSenvironment variable for the job. To use OCI locator string mechanism for all
LongNVarCharlink columns, specify the special value
(all)for this environment variable (the parenthesis are included with the value). Note that for the link columns that represent
LONG RAWtarget table columns the connector must be configured to use piecewise insert/update mechanism. Also note that for the link columns that represent
XMLTypetarget table columns the connector must be configured to use OCI locator string mechanism if the values are larger than 4000 bytes. Do not confuse LOB locators with the locator strings used when passing values by reference. The former is an Oracle mechanism for accessing large values in the database, the latter is a DataStage mechanism for propagating large values through the job.
The character set form and character set conversion considerations that
affect the performance and that were presented earlier in the section which covered
table column data types apply here as well. Simply assume the use of
LongNVarChar link columns instead of the
link columns as you take those considerations into account.
As we saw in this section, to populate target table with records that
LongNVarChar link columns you have three
different options: bulk load mode, insert mode with piecewise insert
mechanism and insert mode with OCI LOB locator mechanism. The three modes
are available when the values arrive to the stage as inline values as well
as when they arrive as locator strings (values passed by reference).
Typically, the bulk load mode will have the best performance, followed by
the insert mode with piecewise insert mechanism and then the insert mode
with OCI LOB locator mechanism. Depending on the specifics of your
environment this may not always be the case and one way to determine which
option provides the best performance is to run a series of tests and
compare the results. The exception is the case when the target table has
XMLType data type in which case bulk
load mode should not be considered if
BINARY XML or
OBJECT RELATIONAL storage type (as opposed to
CLOB storage type) is used for those
XMLType columns or if the table is defined as a
XMLType object table.
When fetching, inserting or updating
values, utilize the Oracle
CREATEXML. Refer to the Resources section for the link to the Information Server
Information Center where you can find example statements for accessing
XMLType columns with the connector.
In this article we covered various aspects of performance tuning of the Oracle Connector stage in DataStage jobs. We discussed running Oracle Connector in bulk load mode taking into account data volume considerations, handling table triggers, indexes, and constraints. We also looked at reject link considerations for handling input records that have errors. Finally, we went over how the data types for link columns can affect the connector performance.
In combination with the information listed in the Resources section and the information you collected through your own research, this article should help you achieve optimal performance of Oracle Connector stages in your DataStage jobs.
Many thanks to the following contributors for their valuable input and review of this article:
- Paul Stanley, InfoSphere Senior Architect
- Tony Curcio, InfoSphere Product Manager
- Fayaz Adam, InfoSphere Staff Software Engineer
- Use an RSS feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Explore the IBM InfoSphere Information Server Version 9.1 Information Center where you will find technical documentation for Information Server tools and components such as DataStage and Oracle Connector.
- Visit IBM Information Management Redbooks page and download redbook titles such as InfoSphere DataStage Parallel Framework Standard Practices and IBM InfoSphere DataStage Data Flow and Job Design.
- Find details about Oracle Database concepts mentioned in this article in the Oracle Database Documentation Library 11g Release 2 (11.2).
- Access a variety of technical resources for Information Management products in developerWorks Information Management zone.
- Stay up to date with IBM products and IT industry by attending developerWorks Events.
- Follow developerWorks on Twitter.
Get products and technologies
- Evaluate IBM software products in a way most convenient for you: by downloading trial product versions, by trying products online for few hours in a SOA sandbox environment or by accessing products in a cloud computing environment.
- Connect with other developerWorks users in developerWorks Community as you explore developer oriented blogs, forums, groups, wikis and more.