WebSphere® QualityStage
uses out-of-the-box, customizable rules to prepare complex information about
your business entities for a variety of transactional, operational, and analytical
purposes.
WebSphere QualityStage
automates the conversion of data into verified standard formats by using probabilistic
matching, in which variables that are common to records (for example, given
name, date of birth, or sex) are matched when unique identifiers are not available.
WebSphere QualityStage
components include the Match Designer, for designing and testing match passes,
and a set of data-cleansing operations called stages. Information is extracted
from the source system, measured, cleansed, enriched, consolidated, and loaded
into the target system.
At run time, data cleansing jobs consist of
the following sequence of stages:
- Investigate stage
- Gives you complete visibility into the actual condition of data.
- Standardize stage
- Reformats data from multiple systems to ensure that each data type has
the correct content and format.
- Match stages
- Ensure data integrity by linking records from one or more data sources
that correspond to the same customer, supplier, or other entity. Matching
can be used to identify duplicate entities that are caused by data entry variations
or account-oriented business practices. Unduplicate match jobs group records
into sets that have similar attributes. The Reference Match stage matches
reference data to source data by using a variety of match processes.
- Survive stage
- Ensures that the best available data survives and is correctly prepared
for the target.
Business intelligence packages that are available with WebSphere QualityStage
provide data enrichment that is based on business rules. These rules can resolve
issues with common data quality problems such as invalid address fields across
multiple geographies. The following packages are available:
- Worldwide Address Verification and Enhancement System (WAVES)
- Matches address data against standard postal reference data that helps
you verify address information for 233 countries and regions.
- Multinational geocoding
- Used for spatial information management and location-based services by
adding longitude, latitude, and census information to location data.
- Postal certification rules
- Provide certified address verification and enhancement to address fields
to enable mailers to meet the local requirements to qualify for postal discounts.
Where WebSphere QualityStage fits in the IBM Information
Server architecture
WebSphere QualityStage is built around
a services-oriented vision for structuring data quality tasks that are used
by many new enterprise system architectures. As part of the integrated IBM® Information
Server platform, it is supported by a broad range of shared services and benefits
from the reuse of several suite components.
WebSphere QualityStage and WebSphere DataStage™ share
the same infrastructure for importing and exporting data, designing, deploying,
and running jobs, and reporting. The developer uses the same design canvas
to specify the flow of data from preparation to transformation and delivery.
Multiple
discrete services give WebSphere QualityStage the flexibility to match
increasingly varied customer environments and tiered architectures. Figure 1 shows how the WebSphere DataStage and QualityStage Designer
(labeled "Development interface") interacts with other elements of the platform
to deliver enterprise data analysis services.
Figure 1. IBM Information
Server product architecture
The following suite components are shared:
- Common user interface
- The WebSphere DataStage and
QualityStage Designer provides a development environment. The WebSphere DataStage and
QualityStage Administrator provides access to deployment and administrative
functions. WebSphere QualityStage
is tightly integrated with WebSphere DataStage and shares the same design
canvas, which enables users to design jobs with data transformation stages
and data quality stages in the same session.
- Common services
- WebSphere QualityStage
uses the common services in IBM Information Server for logging and security. Because
metadata is shared “live” across tools, you can access services such as impact
analysis without leaving the design environment. You can also access domain-specific
services for enterprise data cleansing such as investigate, standardize, match,
and survive from this layer.
- Common repository
- The repository holds data to be shared by multiple projects. Clients can
access metadata and results of data analysis from the respective service layers.
- Common parallel processing engine
- The parallel processing engine addresses high throughput requirements
for analyzing large quantities of source data and handling increasing volumes
of work in decreasing time frames.
- Common connectors
- Any data source that is supported by IBM Information Server can be used as input
to a WebSphere QualityStage
job by using connectors. The connectors also enable access to
the common repository from the processing engine.