Usage of data transformation

Data transformation means data preprocessing. It is an important step before you create the prediction models and examine the data.

Data imputation

Data imputation is suitable for predictive modeling tasks, such as classification, regression, and clustering. For these tasks, missing values cannot be ignored without causing problems for the result. Moreover, refined internal techniques to handle missing values are often not available or are too time-consuming for large data sets.

Many statistical analysis methods exclude data that has missing values. Data imputation, in contrast, makes it possible to use all this data by replacing missing values with estimated values. The estimated values are based on other available information. When the missing values are replaced, you can analyze the data set by using the standard methods, acting as if the data set is complete.

Data splitting

Data splitting is useful for building and testing the prediction quality of a classification model or a regression model. The data is divided into a training data set and a test data set. You then use one data set to train the prediction model and the other data set to test the prediction model.

Suppose, for example, that you have a large data set that contains demographic information about clients. Because you want to create different models for clients in different cities, you split the input data by the value of the cities column.

Standardization and normalization

Standardization and normalization are used in the stage of data preprocessing. In this stage, the data is prepared for later processing in data mining and machine learning. Both methods scale the data set by modifying continuous attributes to achieve wanted distribution properties.

For these transformations, the original continuous attribute a is used to generate a new continuous attribute a '. This new continuous attribute has a different range or distribution than the original attribute. Common transformations modify the range to fit the [−1,1] interval (normalization), or they modify the distribution to have a mean of 0 and a standard deviation of 1, in which case it is also called standardization.