In its simplest form, WebSphere® DataStage™ performs data transformation
and movement from source systems to target systems in batch and in real time.
The data sources might include indexed files, sequential files, relational
databases, archives, external data sources, enterprise applications, and message
queues. Some of the following transformations might be involved:
- String and numeric formatting and data type conversions.
- Business derivations and calculations that apply business rules and algorithms
to the data. Examples range from straightforward currency conversions to more
complex profit calculations.
- Reference data checks and enforcement to validate customer or product
identifiers. This process is used in building a normalized data warehouse.
- Conversion of reference data from disparate sources to a common reference
set, creating consistency across these systems. This technique is used to
create a master data set (or conformed dimensions) for data about products,
customers, suppliers, and employees.
- Aggregations for reporting and analytics.
- Creation of analytical or reporting databases, such as data marts or cubes.
This process involves denormalizing data into such structures as star or snowflake
schemas to improve performance and ease of use for business users.
WebSphere DataStage can
also treat the data warehouse as the source system that feeds a data mart
as the target system, usually with localized, subset data such as customers,
products and geographic territories.
WebSphere DataStage delivers
four core capabilities:
- Connectivity to a wide range of mainframe, legacy, and enterprise applications,
databases, and external information sources
- Prebuilt library of more than 300 functions
- Maximum throughput by using a parallel, high-performance
processing architecture
- Enterprise-class capabilities for development, deployment, maintenance,
and high-availability
Where WebSphere DataStage fits within the IBM Information
Server architecture
WebSphere DataStage is composed of client-based
design, administration, and operation tools that access a set of server-based
data integration capabilities through a common services layer. Figure 1 shows
the clients that comprise the WebSphere DataStage user interface layer.
Figure 1. WebSphere DataStage clients
Figure 2 shows the elements
that make up the server architecture.
Figure 2. Server architecture
WebSphere DataStage architecture
includes the following components:
- Common user interface
- The following client applications comprise the WebSphere DataStage user interface:
- WebSphere DataStage and
QualityStage Designer
- A graphical design interface that is used to create WebSphere DataStage applications
(known as jobs). Because transformation is an integral part of data quality,
the WebSphere DataStage and
QualityStage Designer is the design interface for both WebSphere DataStage and WebSphere QualityStage.
Each job
specifies the data sources, the required transformations, and the destination
of the data. Jobs are compiled to create executables that are scheduled by
the WebSphere DataStage and
QualityStage Director and run on the WebSphere DataStage server. The Designer client
writes development metadata to the dynamic repository while compiled execution
data that is required for deployment is written to the WebSphere Metadata Server repository.
- WebSphere DataStage and
QualityStage Director
- A graphical user interface that is used to validate, schedule, run, and
monitor WebSphere DataStage job
sequences. The Director client views data about jobs in the operational repository
and sends project metadata to WebSphere Metadata Server to control
the flow of WebSphere DataStage jobs.
- WebSphere DataStage and WebSphere QualityStage
Administrator
- A graphical user interface that is used for administration tasks such
as setting up IBM® Information
Server users; logging, creating, and moving projects; and setting up criteria
for purging records.
- Common services
- The multiple discrete services of WebSphere DataStage give the flexibility that
is needed to configure systems that support increasingly varied user environments
and tiered architectures. The common services provides flexible, configurable
interconnections among the many parts of the architecture:
- Metadata services such as impact analysis and search
- Execution services that support all WebSphere DataStage functions
- Design services that support development and maintenance of WebSphere DataStage tasks
- Common repository
- The common repository holds three types of metadata that are required
to support WebSphere DataStage:
- Project metadata
- All the project-level metadata components including WebSphere DataStage jobs,
table definitions, built-in stages, reusable subcomponents, and routines are
organized into folders.
- Operational metadata
- The repository holds metadata that describes the operational history of
integration process runs, success or failure of jobs, parameters that were
used, and the time and date of these events.
- Design metadata
- The repository holds design time metadata that is created by the WebSphere DataStage and
QualityStage Designer and WebSphere Information Analyzer.
- Common parallel processing engine
- The engine runs executable jobs that extract, transform, and load data
in a wide variety of settings. The engine uses parallelism and pipelining
to handle high volumes of work more quickly.
- Common connectors
- The connectors provide connectivity to a large number of external resources
and access to the common repository from the processing engine. Any data source
that is supported by IBM Information Server can be used as input to or output
from a WebSphere DataStage job.