Phase two: Analyze source data
Phase two of the data cleansing workflow is to learn about your source data, prepare your source data, and understand the quality of the source data.
- Identify whether the source data has the basic structure that your target data requires
- Understand the content of the source data
- Create the input data used in the next phase
Phase two helps you begin understanding the size and complexity of the project for creating cleansed data. If the granularity and structure of the source data closely matches your initial impression of the structure and requirements of the target data, then data cleansing will be less complex. The degree of difference contributes to your project complexity.
Most organizations think they know what data they have. But if you analyzed your data to determine how complete it is, how much of the information is duplicated, and what types of anomalies exist within each data field, you might be surprised. Over time, data integrity weakens. The contents of fields stray from their original intent. The label might say Name, but the field might also contain a title, a tax ID number, or a status, such as Deceased. This information is useful, but not if you cannot locate it.
- Step one: Prepare for data cleansing
- Preparing for working in IBM® InfoSphere® QualityStage® entails:
- Having general knowledge about the information in the source data
- Knowing the format of the source data
- Developing business rules for use iteratively throughout the data cleansing process, which are based on the data structure and content
- Step two: Investigate the source data
- Investigating helps you understand the quality of the source data and clarify the direction of succeeding phases of the workflow. In addition, it indicates the degree of processing you will need to create the cleansed data.
- By investigating data, you gain these benefits:
- Gain a better understanding of the quality of the data
- Identify problem areas, such as blanks, errors, or formatting issues
- Prove or disprove any assumptions you might have about the data
- Learn enough about the data to help you establish business rules at the data level
- Organizing
- Parsing
- Classifying
- Analyzing patterns