Extract and load data
You can develop jobs to move data between various sources and in various formats. These examples show how you can extract data, identify changes between files, and load data to a database or a data warehouse.
- Extract data from a database to a file or data set
- Load data with high-speed native connectivity
- Load data in real time to a data warehouse
- Extract data from SAP
- Fetch data from a web service to enrich data
- Identify database changes
Extract data from a database to a file or data set
You can extract bulk data from a database. For example, the Oracle Connector stage can use customized SQL in the Oracle database to extract the customer address, phone number, and account balance in parallel. You can develop a similar job to extract data from any relational database source.
In the first example, the extracted data is written to a sequential file to share with other departments in the company. In the second example, the extracted data is written to a data set for use downstream in another job. A data set stores data in a persistent form that uses parallel storage; that is, data partitioned storage. Data sets optimize I/O performance by preserving the degree of partitioning.
Load data with high-speed native connectivity
You can read data from a file, sort the data numerically to optimize storage, and bulk load the data to a database in native format. For example, you can load third-party data that your organization receives to a Netezza® table for analysis by using the Netezza Connector stage. Like each of the database connector stages, the Netezza Connector stage in this example uses the native API of the database to maximize performance.
Load data in real time to a data warehouse
You can develop a job to move data in real time to a data warehouse. For example, you can capture real-time information about sales, inventory, and shipments from your company’s transactional systems on disparate platforms. You can then standardize and integrate the information, and deliver it to a data warehouse that supports a single reporting framework.
In this example, the job loads data from orders in real time by using the WebSphere® MQ Connector stage. The job parses the data by using the XML stage and transforms the data by using the Transformer stage. The job then updates the reference table and loads data to a Db2® database by using the Slowly Changing Dimension stage. The Slowly Changing Dimension stage is a processing stage that works within the context of a star schema database.
Extract data from SAP
You can extract data from SAP and load it to SAP Business Warehouse (SAP BW), to an existing enterprise data warehouse, or to a file or other target. For example, you can extract data by using the Advanced Business Application Programming (ABAP) Extract stage. The ABAP Extract stage generates an ABAP program and uploads it to the SAP system. The ABAP program runs an SQL query and extracts the data.
In this sample job, the extracted data is formatted for the target stage by a Transformer stage. The formatted data is loaded to an SAP business warehouse by using the SAP BW stage. The SAP stages are provided by IBM® InfoSphere® Information Server Pack for SAP Applications and InfoSphere Information Server Pack for SAP BW.
Fetch data from a web service to enrich data
You can enrich data by using a web service to retrieve data from an application. For example, you can retrieve data from an HR application like Workday by using the Web Services Transformer stage. This sample job retrieves the data and then parses it by using the XML stage to generate rows and columns.
Identify database changes
You can identify database changes and apply the changes to a target database.
This sample job reads data that is captured by InfoSphere Change Data Capture (CDC) and applies the change data to a target database by using the CDC Transaction stage.
InfoSphere CDC is part of IBM InfoSphere Data Replication, which is a replication technology that captures database changes as they happen and delivers them to InfoSphere DataStage®, target databases, or message queues. InfoSphere Data Replication collects data changes by using low-impact, low-latency database log analysis.
This solution provides several advantages. Overhead on the database system is lower because processing the data is log-based. Only changed data is moved across the network. The total processing on the ETL server is lighter because the server does not need to find the changes.