Cleansing data with InfoSphere QualityStage jobs

The cleansing process can include, but is not limited to, eliminating redundant, obsolete, or inaccurate data. Clean data is a critical component for accurate information, reports, and analyses. Throughout your organization, people make business decisions based on data that is provided to them. By cleansing data, you provide high-quality data.

Introduction to data cleansing
IBM® InfoSphere® QualityStage® provides a methodology and development environment for cleansing and improving data quality for any domain.

Analyzing source data
You use the Investigate stage to analyze the quality of the source data. The Investigate stage helps you determine the business rules that you can use in designing your data cleansing project.

Standardizing data
Standardizing data helps you make the source data internally consistent; that is, each data type has the same kind of content and format.

Matching data
Matching in IBM InfoSphere QualityStage is a probabilistic record linkage system that automates the process of identifying records that are likely to represent the same entity. By using the matching process, you can identify duplicates in your data and group records based on any set of criteria. You can also build relationships between records in multiple files despite variations in the representation of the data and missing or inaccurate information.

Consolidating duplicate records to create the representative record
The Survive stage constructs column values from groups of related or duplicate records and stores the column values in the survive record (the best result) from each group.

Rule sets applied in the data cleansing process
Rule sets check and normalize input data. You can apply cleansing rules to correct the data as it comes in or correct data in multiple databases.

Reference: Stage Editor user interface
The parallel job stage editors all use a generic user interface. This section lists the available stage types and gives a quick guide to their function.