What is a feature engineering?

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Business Development + Partnerships

IBM Research

What is feature engineering?

Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.

Feature engineering is the process of transforming raw data into relevant information for use by machine learning models. In other words, feature engineering is the process of creating predictive model features. A feature—also called a dimension—is an input variable used to generate model predictions. Because model performance largely rests on the quality of data used during training, feature engineering is a crucial preprocessing technique that requires selecting the most relevant aspects of raw training data for both the predictive task and model type under consideration.¹

Before proceeding, a quick note on terminology. Many sources use feature engineering and feature extraction interchangeably to denote the processing of creating model variables.² At times, sources also use feature extraction to refer to remapping an original feature space onto a lower-dimensional feature space.³ Feature selection, by contrast, is a form of dimensionality reduction. Specifically, it is the processing of selecting a subset of variables in order to create a new model with the purpose of reducing multicollinearity, and so maximize model generalizability and optimization.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Feature engineering process

Given a model is only as good as the data on which it is based, data scientists spend a large portion of time on data preparation and feature creation in order to create high-quality models. Depending on the complexity of one’s raw data and the desired predictive model, feature engineering can require much trial and error.

A handful of sources and online tutorials break feature engineering down into discrete steps, the number and names of which typically vary. These steps can include feature understanding, structuring or construction, transformation, evaluation, optimization and the list goes on.⁴ While such stratification can be useful for providing a general overview of the tasks involved in featuring engineering, it suggests that feature engineering is a linear process. In actual fact, feature engineering is an iterative process.

Feature engineering is context-dependent. It requires substantial data analysis and domain knowledge. This is because effective encoding for features can be determined by the type of model used, the relationship between predictors and output, as well as the problem a model is intended to address.⁵ This is coupled by the fact that different kinds of datasets—for example text versus images—may be better suited for different feature engineering techniques.⁶ Thus, it can be difficult to make specific remarks on how to best implement feature engineering within a given machine learning algorithms.

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

Feature engineering techniques

Although there is no universally preferred feature engineering method or pipeline, there are a handful of common tasks used to create features from different data types for different models. Before implementing any of these techniques, however, one must remember to conduct a thorough data analysis to determine both the relevant features and appropriate number of features for addressing a given problem. Additionally, it is best to implement various data cleaning and preprocessing techniques, such as imputation for missing data or missing values, while also addressing outliers that can negatively impact model predictions.

Feature transformation

Feature transformation is the process of converting one feature type into another, more readable form for a particular model. This consists of transforming continuous into categorical data, or vice-versa.

Binning. This technique essentially transforms continuous, numerical values into categorical features. Specifically, binning compares each value to the neighborhood of values surrounding it and then sorts data points into a number of bins. A rudimentary example of binning is age demographics, in which continuous ages are divided into age groups, for example 18-25, 25-30, and so on. Once values have been placed into bins, one can further smooth the bins by means, medians or boundaries. Smoothing bins replaces a bin’s contained values with bin-derived values. For instance, if we smooth a bin containing age values between 18-25 by the mean, we replace each value in that bin with the mean of that bin’s values. Binning creates categorical values from continuous ones. Smoothing bins is a form of local smoothing meant to reduce noise in input data.⁷

One-hot encoding. This is the inverse of binning; it creates numerical features from categorical variables. One-hot encoding maps categorical features to binary representations, which are used to map the feature in a matrix or vector space. Literature often refers to this binary representation as a dummy variable. Because one-hot encoding ignores order, it is best used for nominal categories. Bag of words models are an example of one-hot encoding frequently used in natural language processing tasks. Another example of one-hot encoding is spam filtering classification in which the categories spam and not spam are converted to 1 and 0 respectively.⁸

Feature extraction and selection

Feature extraction is a technique for creating a new dimensional space for a model by combining variables into new, surrogate variables or in order to reduce dimensions of the model’s feature space.⁹ By comparison, feature selection denotes techniques for selecting a subset of the most relevant features to represent a model. Both feature extraction and selection are forms of dimensionality reduction, and so suitable for regression problems with a large number of features and limited available data samples.

Principal component analysis. Principal component analysis (PCA) is a common feature extraction method that combines and transforms a dataset’s original features to produce new features, called principal components. PCA selects a subset of variables from a model that together comprise the majority or all of the variance present in the model’s original set of variables. PCA then projects data onto a new space defined by this subset of variables.¹⁰

Linear discriminant analysis. Linear discriminant analysis (LDA) is ostensibly similar to PCA in that it projects model data onto a new, lower dimensional space. As in PCA, this model space’s dimensions (or features) are derived from the initial model’s features. LDA differs from PCA, however, in its concern for retaining classification labels in the original dataset. While PCA produces new component variables meant to maximize data variance, LDA produces component variables primarily intended to maximize class difference in the data.¹¹

Feature scaling

Certain features have upper and lower bounds intrinsic to data that limits possible feature values, such as time-series data or age. But in many cases, model features may not have a limitation on possible values, and such large feature scales (being the difference between a features lowest and highest values) can negatively affect certain models. Feature scaling (sometimes called feature normalization) is a standardization technique to rescale features and limit the impact of large scales on models.¹² While feature transformation transforms data from one type to another, feature scaling transforms data in terms of range and distribution, maintaining its original data type.¹³

Min-max scaling. Min-max scaling rescales all values for a given feature so that they fall between specified minimum and maximum values, often 0 and 1. Each data point’s value for the selected feature (represented by x) is computed against the decided minimum and maximum feature values, min(x) and max(x) respectively, which produces the new feature value for that data point (represented by x̃ ). Min-max scaling is calculated using the formula:¹⁴

Z-score scaling. Literature also refers to this as standardization and variance scaling. Whereas min-max scaling scales feature values to fit within designated minimum and maximum values, z-score scaling rescales features so that they have a shared standard deviation of 1 with a mean of 0. Z-score scaling is represented by the formula:

Here, a given feature value (x) is computed against the rescaled feature’s mean and divided by the standardized standard deviation (represented as sqrt(var(x))). Z-score scaling can be useful when implementing feature extraction methods like PCA and LDA, as these two methods require features to share the same scale.¹⁵

Recent research

Automation. Automated feature engineering, admittedly, has been an ongoing area of research for a few decades.¹⁶ Python libraries such as "tsflex" and "featuretools" help automate feature extraction and transformation for time series data. Developers continue to provide new packages and algorithms to automate feature engineering for linear regression models and other data types that increase model accuracy.¹⁷ More recently, automated feature engineering has figured as one part of larger endeavors to build automated machine learning (AutoML) systems, which aim to make machine learning more accessible to non-experts.¹⁸

Deep learning. Feature engineering can be a laborious and time-consuming process, involving a significant amount of trial and error. Deep learning allows the user to specify a small set of basic features that the neural network architecture aggregates into higher-level features, also called representations.¹⁹ One such example is computer vision image processing and pattern recognition, in which a model learns to identify semantically meaningful objects (for example cars, people, and so on) in terms of simple concepts (for example edges, contours, and so on) by concatenating feature maps.²⁰ Recent studies, however, have combined feature engineering with neural networks and other deep learning techniques classification tasks, such as fraud detection, with promising results.²¹

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

What is feature engineering?

Authors

What is feature engineering?

The latest AI trends, brought to you by experts

Thank you! You are subscribed.

Feature engineering process

Decoding AI: Weekly News Roundup

Feature engineering techniques

Feature transformation

Feature extraction and selection

Feature scaling

Recent research

Resources