Big data processing

You can develop jobs that exchange data with big data sources. These examples show how you can access files on the Hadoop Distributed File System (HDFS) and augment data with Hadoop-based analytics.

Access data on HDFS
Augment data with Hadoop-based analytics

Access data on HDFS

You can access files on HDFS. This sample job accesses orders from HDFS files by using the Big Data File stage. The job uses a Transformer stage to select a subset of the orders, combines the orders with order details, and writes the ordered items to subsequent HDFS files. You can deploy this job directly in InfoSphere® DataStage®, which provides massive scalability by running jobs on the InfoSphere Information Server parallel engine. Alternatively, you can use IBM® InfoSphere DataStage Balanced Optimization to process this logic within the Hadoop cluster. The job logic is then represented as MapReduce scripting.

Figure 1. Access data on HDFS

The figure shows a job that accesses input data by using Big Data File stages and processes the data by using a Transformer stage and a Join stage. The job loads the data to another Big Data File stage.

Augment data with Hadoop-based analytics

You can augment data in a data warehouse with Hadoop-based analytical results. This sample job moves the analytical data from a Hive data warehouse system to a Netezza® data warehouse.

The Hive stage runs on top of the Java™ Integration stage and provides a Hive connector for InfoSphere DataStage.

Figure 2. Augmenting data with Hadoop-based analytics

The figure shows a job that extracts data from a Hive warehouse system and loads it to a Netezza data warehouse by using a Netezza Connector stage.