A closer look at WebSphere DataStage

In its simplest form, WebSphere^® DataStage™ performs data transformation and movement from source systems to target systems in batch and in real time.

The data sources might include indexed files, sequential files, relational databases, archives, external data sources, enterprise applications, and message queues. Some of the following transformations might be involved:

String and numeric formatting and data type conversions.
Business derivations and calculations that apply business rules and algorithms to the data. Examples range from straightforward currency conversions to more complex profit calculations.
Reference data checks and enforcement to validate customer or product identifiers. This process is used in building a normalized data warehouse.
Conversion of reference data from disparate sources to a common reference set, creating consistency across these systems. This technique is used to create a master data set (or conformed dimensions) for data about products, customers, suppliers, and employees.
Aggregations for reporting and analytics.
Creation of analytical or reporting databases, such as data marts or cubes. This process involves denormalizing data into such structures as star or snowflake schemas to improve performance and ease of use for business users.

WebSphere DataStage can also treat the data warehouse as the source system that feeds a data mart as the target system, usually with localized, subset data such as customers, products and geographic territories.

WebSphere DataStage delivers four core capabilities:

Connectivity to a wide range of mainframe, legacy, and enterprise applications, databases, and external information sources
Prebuilt library of more than 300 functions
Maximum throughput by using a parallel, high-performance processing architecture
Enterprise-class capabilities for development, deployment, maintenance, and high-availability

Where WebSphere DataStage fits within the IBM Information Server architecture

WebSphere DataStage is composed of client-based design, administration, and operation tools that access a set of server-based data integration capabilities through a common services layer. Figure 1 shows the clients that comprise the WebSphere DataStage user interface layer.

Figure 1. WebSphere DataStage clients

Figure 2 shows the elements that make up the server architecture.

Figure 2. Server architecture

IBM Information Server architecture with Transform highlighted

WebSphere DataStage architecture includes the following components:

Common user interface

The following client applications comprise the WebSphere DataStage user interface:

WebSphere DataStage and QualityStage Designer: A graphical design interface that is used to create WebSphere DataStage applications (known as jobs). Because transformation is an integral part of data quality, the WebSphere DataStage and QualityStage Designer is the design interface for both WebSphere DataStage and WebSphere QualityStage.
Each job specifies the data sources, the required transformations, and the destination of the data. Jobs are compiled to create executables that are scheduled by the WebSphere DataStage and QualityStage Director and run on the WebSphere DataStage server. The Designer client writes development metadata to the dynamic repository while compiled execution data that is required for deployment is written to the WebSphere Metadata Server repository.
WebSphere DataStage and QualityStage Director: A graphical user interface that is used to validate, schedule, run, and monitor WebSphere DataStage job sequences. The Director client views data about jobs in the operational repository and sends project metadata to WebSphere Metadata Server to control the flow of WebSphere DataStage jobs.
WebSphere DataStage and WebSphere QualityStage Administrator: A graphical user interface that is used for administration tasks such as setting up IBM^® Information Server users; logging, creating, and moving projects; and setting up criteria for purging records.

Common services

The multiple discrete services of WebSphere DataStage give the flexibility that is needed to configure systems that support increasingly varied user environments and tiered architectures. The common services provides flexible, configurable interconnections among the many parts of the architecture:

Metadata services such as impact analysis and search
Execution services that support all WebSphere DataStage functions
Design services that support development and maintenance of WebSphere DataStage tasks

Common repository

The common repository holds three types of metadata that are required to support WebSphere DataStage:

Project metadata: All the project-level metadata components including WebSphere DataStage jobs, table definitions, built-in stages, reusable subcomponents, and routines are organized into folders.
Operational metadata: The repository holds metadata that describes the operational history of integration process runs, success or failure of jobs, parameters that were used, and the time and date of these events.
Design metadata: The repository holds design time metadata that is created by the WebSphere DataStage and QualityStage Designer and WebSphere Information Analyzer.

Common parallel processing engine

The engine runs executable jobs that extract, transform, and load data in a wide variety of settings. The engine uses parallelism and pipelining to handle high volumes of work more quickly.

Common connectors

The connectors provide connectivity to a large number of external resources and access to the common repository from the processing engine. Any data source that is supported by IBM Information Server can be used as input to or output from a WebSphere DataStage job.