Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
2 replies Latest Post - ‏2013-01-28T18:26:30Z by RKRathinavel
RKRathinavel
RKRathinavel
2 Posts
ACCEPTED ANSWER

Pinned topic Performance issue with derived expression

‏2013-01-28T15:05:01Z |
We are using CDC to replicate some larger data mart tables between DB2 UDB (Source) and Oracle 11g (target) database. The tables have more than 200 million rows of data. We are using a derived column on one of the field on the source table. The expression for the derived column has %GETCOL, %CONCAT and %IF statement. We are doing the initial refresh of these tables to the target environment. With the derived column CDC the performance of the CDC refresh degraded a lot. Without the derived column the table was refresh at a rate of 9 million rows per hour. Once we had the derived column the table was refreshing at a rate of 2.5 million rows per hour.
The derived expression is given below. We use a lookup table  (LOOKUP_TBL) to get the column GL_DEPT_NO using GL_DEPT as the key field if the GL_DEPT field value is not blank in the replication table. If the GL_DEPT value is non blank and if it is not on the lookup table then the GL_DEPT value is prefixed with  "0" to get the derived value.
 

%IF(GL_DEPT=' ',' ', %GETCOL(GL_DEPT_NO,"PSYSADM.LOOKUP_TBL",%CONCAT("0",GL_DEPT),GL_DEPT,GL_DEPT))

 
Is there anything we can do to improve the performance of the initial load?. Any help is greatly appreciated. 
 
Updated on 2013-01-28T18:26:30Z at 2013-01-28T18:26:30Z by RKRathinavel
  • Rphilo
    Rphilo
    371 Posts
    ACCEPTED ANSWER

    Re: Performance issue with derived expression

    ‏2013-01-28T16:49:20Z  in response to RKRathinavel
     Hi
     
    The real cost in the derived  column expression is the %GETCOL. I would check that the PSYSADM.LOOKUP_TBL has a unique index based on the GL_DEPT column. Presumably there are a significant number of rows in the table, otherwise if the table is not changing very often you could look at loading the table into some sort of memory structure and accessing it through a derived expression which calls a Java user exit. Incidentally, if the table has a volume of updates in line with the number of rows, it may be that the %GETCOL also impacts on mirror performance as well as refresh performance.
     
    Another option would be to cut out the %GETCOL and replicate your PSYSADM.LOOKUP_TBL table to the target in its own right. You could then just perform a join in the the target processes to consume the data via a view of some sort. I would refrain from suggesting that you have a target %GETCOL as this will probably stop a refresh via bulk load into the target database.
     
    If you find that the issue is only with a refresh, and mirroring with the source %GETCOL in place is not an issue, then it may be that you can look at parallelising the replication with multiple subscriptions: you could then be refreshing (and mirroring) other tables at the same time that the big table is being slowly refreshed. How feasible this is for you depends on whether there are any referential integrity constraints you need to consider as well as whether there is the spare capacity to make it beneficial, as with multiple subscriptions there is no longer any guarantee that transactions will be applied in the same order as they are committed on the source, the subscriptions running independently.
     
    If you do not have any other derived columns or expressions, another option might be to perform a bulk extract of the table on the source database and then perform a bulk load on the target separately from CDC. You would then need to establish the point for starting mirroring yourself, and you would need to create a procedure to populate the GL_DEPT_NO column separately. If you have a time when the source database is not active this is not too difficult, If the database is always active, then you would have to mark the capture point for mirroring at the start of the extract process and then work out how you can tolerate having some rows applied to the target side twice (once from the refreshm and once because transactions for those rows were committed during the extract period). If you have audit mode, this is probably OK as you will have multiple versions of the same row anyway. If you have adaptive apply or row consolidation,, this will be handled automatically by the mapping method. If you have have standard replication you could use conflict detection and resolution to force through the source change (e.g. in the case of a new row captured by the extract and by the log scraper), but you should then end replication and remove the CDR if you are only using it for this purpose as presumably you would want any failure after the "in-doubt" period to be processed as normal rather than being automatically handled. II know of a site with a massive database (significant number of LOBs) who is currently testing the extract/upload approach as the standard refresh would probably be too small.
     
    Hope this gives you some ideas
     
    Regards
     
    Robert
    • RKRathinavel
      RKRathinavel
      2 Posts
      ACCEPTED ANSWER

      Re: Performance issue with derived expression

      ‏2013-01-28T18:26:30Z  in response to Rphilo
       Robert,
       
      Thanks a lot for the detailed response. The LOOKUP_TBL does have a unique index based on the GL_DEPT column. The look up table is a very small table with approximately 100 rows. As you suggested, I will try to use some sort of memory structure and evalate the performance. We think the problem is with only the initial refresh. Approximately 300,000 rows gets inserted into the replication table every week. There are very few updates. We think that mirroing may catch up in few minutes for the 300,000 rows that get's inserted. The initial refresh is the time consuming and problematic one for us.
       
      We have some 3 tables with more than 200 millions rows and we are already running them in parallel.  We are also looking at the possiblilty of loading the table outisde of CDC during the weekend (as these tables are hardly updated during weekends) and use table capture point to mirror the tables after the initial load. I will update you as soon as we apply one of the above recommendations. Again thanks a lot for the detailed response. I really appreciate your help.