Source data preparation
As you plan your project, you need to prepare the source data to realize the best results.
IBM® InfoSphere® QualityStage® accepts all basic data types (non-vector, non-aggregate) other than binary. Non-basic data types cannot be acted upon in InfoSphere QualityStage except for vectors in the match stages. However, non-basic data types can be passed through the InfoSphere DataStage® and QualityStage stages.
You can use various processing stages to construct some columns before using the columns in a stage that you use for data cleansing. In particular, create overlay column definitions, vector columns, and concatenated columns as explicit columns in the data before you use them.
For example, you do not need to declare the first three characters of a five-character postal code column as a separate additional column. Instead, you can use a Transformer stage to add the column to the source data explicitly before using the column in a stage that you use for data cleansing.
Conform the actual data to be matched to the following practices:
- Make the codes used in columns the same for both data source and
reference source.
For example, if the Gender column in the data source uses M and F as gender codes, the corresponding column in the reference source should also use M and F as gender codes (not, for example, 1 or 0 as gender codes).
- Whatever missing value condition you use (for example, spaces or 99999) must be converted in advance to the null character. Conversion can be done using the InfoSphere DataStage Transformer stage. If you are extracting data from a database, make sure that nulls are not converted to spaces.
Use the Standardize stage to standardize data such as individual names or postal addresses. Complex conditions can be handled by creating new columns before matching begins.