Analyzing source data

You use the Investigate stage to analyze the quality of the source data. The Investigate stage helps you determine the business rules that you can use in designing your data cleansing project.

The Investigate stage indicates the degree of processing needed to create the target cleansed data. Investigating data identifies errors and validates the contents of fields in a data file. This investigation lets you identify and correct data problems before they infect new systems.

The Investigate stage analyzes data by determining the number and frequency of unique values, and classifying or assigning a business meaning to each occurrence of a value within a column. The Investigate stage has the following capabilities:

Assesses the content of the source data. This stage organizes, parses, classifies, and analyzes patterns in the source data. It operates on both single-domain data columns as well as free-form text columns such as address columns.
Accepts a single input link from any database connector supported by InfoSphere® DataStage®, a flat file or data set, or from any processing stage. It is not necessary to restrict the data to fixed-length columns, but all input data must be alphanumeric.
Produces output for one or two output links, depending on whether you are preparing information for one or two reports. Character investigations produce information for a column frequency report and word investigations produce information for both pattern and token reports. The Investigate stage performs a single investigation.

The Investigation reports, which you can generate from the IBM® InfoSphere Information Server Web console by using data processed in the investigation job, can help you evaluate your data and develop better business practices.