Performance tuning tips

Use the following tips to improve performance of ETLs with Spoon.

Summary

  1. When you are finished getting your ETL to do what you want it to do, take a baseline performance measurement.
  2. Using Spoon, run the ETL against a database where you have thousands of rows added to your fact table.
  3. Make sure that you are using a JNDI connection and running Spoon on the network where the database lives so that you do not have network latency. Do not run it through a VPN.
  4. Get a completed list of your run. For example, from a run of the triSpacePeopleFact ETL.

Analysis

  1. JavaScript and DB Procedure (Get Next Spec ID) steps have multiple copies. Right-click on the step and changing the number of copies to start.
    • For the triSpacePeopleFact ETL in the preceding example run, changing the JavaScript and DB Procedure (Get Next Spec ID) steps to three copies of each:
    • Unaltered: 12.9, 12.7, 12.5, 12.6
    • Three copies of each: 11.4, 11.6, 12.3, 12.2
  2. Change the default row set size from 1000 to 10000. New transformations have this set automatically. Right-click the ETL and open the properties of the transform.
  3. Analyze the run. Is there a bottleneck? Is there a step that is slower than the others? Possibly other steps can have multiple copies for better throughput.
  4. Is the Data Input step a bottleneck? Will an index to the database help? If so, add an index and rerun. Is the performance better? Maybe use a Filter step instead of using the database to filter down the result set.
  5. Analysis is an iterative process. Always have multiple copies of the JavaScript and DB Procedure (Get Next Spec ID) steps.
  6. An ETL run with 300-800 rows per second is performing well and definitely in the acceptable performance range.

For the triSpacePeopleFact ETL, after initial development substantial improvements were achieved by just doing Steps 1 and 2.

Whereas, for the triSpaceFact ETL, substantial improvements were achieved by doing Steps 1, 2, and 4.

The following shows the triSpacePeopleFact ETL run with Steps 1 and 2:
Query for Space People: Time = 11.6 sec; Speed (r/s) = 743.6
The following shows the triSpaceFact ETL run with Steps 1 and 2:
Query for Space: Time = 313.9 sec; Speed (r/s) = 24.0

Notice that it is clear that the Query for Space step, which is the Data Input step, is a bottleneck at 24 rows per second.

Notice that the Query for Space People is not a bottleneck like the Query for Space step. The triSpaceFact ETL runs well without any modifications besides Steps 1 and 2, getting over 700 rows per second.

For Step 4 on the triSpaceFact ETL, look at the SQL for the Query for Space task. Notice in the SQL that there are SUMs. SUMs are expensive, especially since there are two of them and none of the fields are indexed.

Add an index to T_TRIORGANIZATIONALLOCATION.TRILOCATIONLOOKUPTXOBJID. It is only necessary to add an index to TRILOCATIONLOOKUPTXOBJID, even though the TRISTATUSCL is in the SELECT SUM WHERE. TRISTATUSCL is a 1000 character field and made the index slow and not even viable on SQL Server.

CREATE INDEX IDX01_TRIORGALLOC ON T_TRIORGANIZATIONALLOCATION 
(TRILOCATIONLOOKUPTXOBJID) NOPARALLEL;

Rerun the ETL.

The following shows the triSpaceFact ETL run with Steps 1, 2, and 4.
Query for Space: Time = 3.2 sec; Speed (r/s) = 2378.3

Notice that the change in the Data Input step rows per second (2378.3) and how long the ETL took to run (3.2 seconds for 7544 rows).

Important: Things to keep in mind while you are developing your ETLs:
  • Avoid complex SQL and aggregate functions like COUNT, MIN, MAX, and SUM. If you need to use these functions, see whether an index helps out the Data Input step. Do not create an index on a field that is large varchar; SQL Server can handle only indexes < 900 bytes.
  • Avoid OR and NOT and using views (M_TableName in TRIRIGA® databases) if possible.
  • Use the Calculator step instead of JavaScript if that is possible. The JavaScript step can be expensive.
  • Have only one JavaScript scripting step.