Extract and load data

You can develop jobs to move data between various sources and in various formats. These examples show how you can extract data, identify changes between files, and load data to a database or a data warehouse.

Extract data from a database to a file or data set

You can extract bulk data from a database. For example, the Oracle Connector stage can use customized SQL in the Oracle database to extract the customer address, phone number, and account balance in parallel. You can develop a similar job to extract data from any relational database source.

In the first example, the extracted data is written to a sequential file to share with other departments in the company. In the second example, the extracted data is written to a data set for use downstream in another job. A data set stores data in a persistent form that uses parallel storage; that is, data partitioned storage. Data sets optimize I/O performance by preserving the degree of partitioning.

Figure 1. Extract data from a database to a sequential file
The figure shows a job that extracts data from a database by using an Oracle Connector stage and transforms the data by using a Transformer stage. The job writes the data to a target file by using a Sequential File stage.
Figure 2. Extract data from a database to a data set
The figure shows a job that extracts data from a database by using an Oracle Connector stage and transforms the data by using a Transformer stage. The job writes the data to a target data set by using a Data Set stage.

Load data with high-speed native connectivity

You can read data from a file, sort the data numerically to optimize storage, and bulk load the data to a database in native format. For example, you can load third-party data that your organization receives to a Netezza® table for analysis by using the Netezza Connector stage. Like each of the database connector stages, the Netezza Connector stage in this example uses the native API of the database to maximize performance.

Figure 3. Load to a Netezza table
The figure shows a job that extracts data from the source file by using a Sequential File stage and sorts the data by using a Sort stage. The job loads the data to the target database by using a Netezza Connector stage.

Load data in real time to a data warehouse

You can develop a job to move data in real time to a data warehouse. For example, you can capture real-time information about sales, inventory, and shipments from your company’s transactional systems on disparate platforms. You can then standardize and integrate the information, and deliver it to a data warehouse that supports a single reporting framework.

In this example, the job loads data from orders in real time by using the WebSphere® MQ Connector stage. The job parses the data by using the XML stage and transforms the data by using the Transformer stage. The job then updates the reference table and loads data to a Db2® database by using the Slowly Changing Dimension stage. The Slowly Changing Dimension stage is a processing stage that works within the context of a star schema database.

Figure 4. Load data to a data warehouse
The figure shows the job that is described in the text.

Extract data from SAP

You can extract data from SAP and load it to SAP Business Warehouse (SAP BW), to an existing enterprise data warehouse, or to a file or other target. For example, you can extract data by using the Advanced Business Application Programming (ABAP) Extract stage. The ABAP Extract stage generates an ABAP program and uploads it to the SAP system. The ABAP program runs an SQL query and extracts the data.

In this sample job, the extracted data is formatted for the target stage by a Transformer stage. The formatted data is loaded to an SAP business warehouse by using the SAP BW stage. The SAP stages are provided by IBM® InfoSphere® Information Server Pack for SAP Applications and InfoSphere Information Server Pack for SAP BW.

Figure 5. Extract data from SAP
The figure shows a job that extracts data by using an ABAP Extract stage and formats the data by using a Transformer stage. The job loads the data by using an SAP BW stage.

Fetch data from a web service to enrich data

You can enrich data by using a web service to retrieve data from an application. For example, you can retrieve data from an HR application like Workday by using the Web Services Transformer stage. This sample job retrieves the data and then parses it by using the XML stage to generate rows and columns.

Figure 6. Fetch data from a web service
The figure shows a job that extracts data by using a Sequential File stage and retrieves the data by using a Web Services Transformer stage. The job parses the data by using an XML stage and loads the data by using another Sequential File stage.

Identify database changes

You can identify database changes and apply the changes to a target database.

This sample job reads data that is captured by InfoSphere Change Data Capture (CDC) and applies the change data to a target database by using the CDC Transaction stage.

InfoSphere CDC is part of IBM InfoSphere Data Replication, which is a replication technology that captures database changes as they happen and delivers them to InfoSphere DataStage®, target databases, or message queues. InfoSphere Data Replication collects data changes by using low-impact, low-latency database log analysis.

This solution provides several advantages. Overhead on the database system is lower because processing the data is log-based. Only changed data is moved across the network. The total processing on the ETL server is lighter because the server does not need to find the changes.

Figure 7. Identify changes with the CDC Transaction stage
The figure shows a job that reads captured data by using a CDC Transaction stage. The job loads change data and bookmark information to the target database by using a Db2 Connector stage.