A closer look at WebSphere QualityStage

WebSphere^® QualityStage uses out-of-the-box, customizable rules to prepare complex information about your business entities for a variety of transactional, operational, and analytical purposes.

WebSphere QualityStage automates the conversion of data into verified standard formats by using probabilistic matching, in which variables that are common to records (for example, given name, date of birth, or sex) are matched when unique identifiers are not available.

WebSphere QualityStage components include the Match Designer, for designing and testing match passes, and a set of data-cleansing operations called stages. Information is extracted from the source system, measured, cleansed, enriched, consolidated, and loaded into the target system.

At run time, data cleansing jobs consist of the following sequence of stages:

Investigate stage: Gives you complete visibility into the actual condition of data.
Standardize stage: Reformats data from multiple systems to ensure that each data type has the correct content and format.
Match stages: Ensure data integrity by linking records from one or more data sources that correspond to the same customer, supplier, or other entity. Matching can be used to identify duplicate entities that are caused by data entry variations or account-oriented business practices. Unduplicate match jobs group records into sets that have similar attributes. The Reference Match stage matches reference data to source data by using a variety of match processes.
Survive stage: Ensures that the best available data survives and is correctly prepared for the target.

Business intelligence packages that are available with WebSphere QualityStage provide data enrichment that is based on business rules. These rules can resolve issues with common data quality problems such as invalid address fields across multiple geographies. The following packages are available:

Worldwide Address Verification and Enhancement System (WAVES): Matches address data against standard postal reference data that helps you verify address information for 233 countries and regions.
Multinational geocoding: Used for spatial information management and location-based services by adding longitude, latitude, and census information to location data.
Postal certification rules: Provide certified address verification and enhancement to address fields to enable mailers to meet the local requirements to qualify for postal discounts.

Where WebSphere QualityStage fits in the IBM Information Server architecture

WebSphere QualityStage is built around a services-oriented vision for structuring data quality tasks that are used by many new enterprise system architectures. As part of the integrated IBM^® Information Server platform, it is supported by a broad range of shared services and benefits from the reuse of several suite components.

WebSphere QualityStage and WebSphere DataStage™ share the same infrastructure for importing and exporting data, designing, deploying, and running jobs, and reporting. The developer uses the same design canvas to specify the flow of data from preparation to transformation and delivery.

Multiple discrete services give WebSphere QualityStage the flexibility to match increasingly varied customer environments and tiered architectures. Figure 1 shows how the WebSphere DataStage and QualityStage Designer (labeled "Development interface") interacts with other elements of the platform to deliver enterprise data analysis services.

Figure 1. IBM Information Server product architecture IBM Information Server architecture with Cleanse highlighted

The following suite components are shared:

Common user interface: The WebSphere DataStage and QualityStage Designer provides a development environment. The WebSphere DataStage and QualityStage Administrator provides access to deployment and administrative functions. WebSphere QualityStage is tightly integrated with WebSphere DataStage and shares the same design canvas, which enables users to design jobs with data transformation stages and data quality stages in the same session.
Common services: WebSphere QualityStage uses the common services in IBM Information Server for logging and security. Because metadata is shared “live” across tools, you can access services such as impact analysis without leaving the design environment. You can also access domain-specific services for enterprise data cleansing such as investigate, standardize, match, and survive from this layer.
Common repository: The repository holds data to be shared by multiple projects. Clients can access metadata and results of data analysis from the respective service layers.
Common parallel processing engine: The parallel processing engine addresses high throughput requirements for analyzing large quantities of source data and handling increasing volumes of work in decreasing time frames.
Common connectors: Any data source that is supported by IBM Information Server can be used as input to a WebSphere QualityStage job by using connectors. The connectors also enable access to the common repository from the processing engine.