Data Preparation Overview

Data preparation is one of the most important and often time-consuming aspects of data mining. In fact, it is estimated that data preparation usually takes 50-70% of a project's time and effort. Devoting adequate energy to the earlier business understanding and data understanding phases can minimize this overhead, but you still need to expend a good amount of effort preparing and packaging the data for mining.

Depending on your organization and its goals, data preparation typically involves the following tasks:

  • Merging data sets and/or records
  • Selecting a sample subset of data
  • Aggregating records
  • Deriving new attributes
  • Sorting the data for modeling
  • Removing or replacing blank or missing values
  • Splitting into training and test data sets