Change Data Capture

Change Data Capture identifies and processes only the data that changed in each of the tables in a database and makes the changed data available to the Data Warehouse.

Data needs to be extracted periodically from one or more source systems and transformed to the data warehouse. This process is commonly referred to as refreshing the data warehouse. The most efficient refresh method is to extract and transform only the data that changed since the last extraction. BIA Reporting is designed with the intention that the refresh takes place on a nightly basis. However, the implementation is flexible and it is possible to run the refresh at a different frequency.

BIA Reporting 'Change Data Capture' techniques include using a control table, which stores a last written date for each table that is being populated. When an ETL runs, the last written field for that table is also updated. The next time the ETL runs, it first reads from this control table and then extracts the data that is updated since the previous ETL run.

It is important to note that for change data capture to work in BIA Reporting, all the last written fields must be populated in the source tables that the reporting solution extract data from.

Three control tables that are provided with BIA Reporting. Each of the control tables contains a list of all the tables that are populated in that database:

  • Staging ETL Control table : This table is created in the Staging database and is used to extract data from tables in the source database to tables in the Staging database using the last written field to extract only the data that is changed since the last extraction. This table includes a truncate flag which, when set to 'Y', truncates the destination table in the Staging database before you run the ETL. When set to 'N' the table is not be truncated before the ETL runs. The default is 'Y' as there is no need to build up a history of changes in the Staging database. 'N' is used for a reference table, CODETABLEITEM, as some ETLs use this table to retrieve descriptions of codes.
  • CDW ETL Control table : This table is in the CDW and is used to extract data from tables in the Staging database, by using the last written field to extract only the data that is changed since the last extraction.
  • Datamart ETL Control table : This table is in the Datamart and is used to extract data from tables in the CDW, using the last written field to extract only the data that is changed since the last extraction.

As already stated, a row in the ETL Control table is updated before and after every ETL run for the table, which is being updated. The ETLs work by calling a pre-mapping transformation to read the previous last written date and setting the extract time. After the ETL is run, a post-mapping transformation is called which updates the last written date to the current date. The transformations are not supported by the ETL tool. The transformations are custom BIA Reporting transformations that are written in java called from the ETL tool.

After the ETL Control table is initially populated with data, the last written date is reset to a start date to ensure that the ETLs extract all data that is updated after this date. The developer can manually set the last written date or use the resetetl.XXX build command (where x is staging, central, or datamarts) which resets the last written date for all tables in that database to the 1st of January, 1934.