Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.
Feature engineering is the process of transforming raw data into relevant information for use by machine learning models. In other words, feature engineering is the process of creating predictive model features. A feature—also called a dimension—is an input variable used to generate model predictions. Because model performance largely rests on the quality of data used during training, feature engineering is a crucial preprocessing technique that requires selecting the most relevant aspects of raw training data for both the predictive task and model type under consideration.1
Before proceeding, a quick note on terminology. Many sources use feature engineering and feature extraction interchangeably to denote the processing of creating model variables.2 At times, sources also use feature extraction to refer to remapping an original feature space onto a lower-dimensional feature space.3 Feature selection, by contrast, is a form of dimensionality reduction. Specifically, it is the processing of selecting a subset of variables in order to create a new model with the purpose of reducing multicollinearity, and so maximize model generalizability and optimization.
Given a model is only as good as the data on which it is based, data scientists spend a large portion of time on data preparation and feature creation in order to create high-quality models. Depending on the complexity of one’s raw data and the desired predictive model, feature engineering can require much trial and error.
A handful of sources and online tutorials break feature engineering down into discrete steps, the number and names of which typically vary. These steps can include feature understanding, structuring or construction, transformation, evaluation, optimization and the list goes on.4 While such stratification can be useful for providing a general overview of the tasks involved in featuring engineering, it suggests that feature engineering is a linear process. In actual fact, feature engineering is an iterative process.
Feature engineering is context-dependent. It requires substantial data analysis and domain knowledge. This is because effective encoding for features can be determined by the type of model used, the relationship between predictors and output, as well as the problem a model is intended to address.5 This is coupled by the fact that different kinds of datasets—for example text versus images—may be better suited for different feature engineering techniques.6 Thus, it can be difficult to make specific remarks on how to best implement feature engineering within a given machine learning algorithms.
Although there is no universally preferred feature engineering method or pipeline, there are a handful of common tasks used to create features from different data types for different models. Before implementing any of these techniques, however, one must remember to conduct a thorough data analysis to determine both the relevant features and appropriate number of features for addressing a given problem. Additionally, it is best to implement various data cleaning and preprocessing techniques, such as imputation for missing data or missing values, while also addressing outliers that can negatively impact model predictions.
Feature transformation is the process of converting one feature type into another, more readable form for a particular model. This consists of transforming continuous into categorical data, or vice-versa.
Binning. This technique essentially transforms continuous, numerical values into categorical features. Specifically, binning compares each value to the neighborhood of values surrounding it and then sorts data points into a number of bins. A rudimentary example of binning is age demographics, in which continuous ages are divided into age groups, for example 18-25, 25-30, and so on. Once values have been placed into bins, one can further smooth the bins by means, medians or boundaries. Smoothing bins replaces a bin’s contained values with bin-derived values. For instance, if we smooth a bin containing age values between 18-25 by the mean, we replace each value in that bin with the mean of that bin’s values. Binning creates categorical values from continuous ones. Smoothing bins is a form of local smoothing meant to reduce noise in input data.7
One-hot encoding. This is the inverse of binning; it creates numerical features from categorical variables. One-hot encoding maps categorical features to binary representations, which are used to map the feature in a matrix or vector space. Literature often refers to this binary representation as a dummy variable. Because one-hot encoding ignores order, it is best used for nominal categories. Bag of words models are an example of one-hot encoding frequently used in natural language processing tasks. Another example of one-hot encoding is spam filtering classification in which the categories spam and not spam are converted to 1 and 0 respectively.8
Feature extraction is a technique for creating a new dimensional space for a model by combining variables into new, surrogate variables or in order to reduce dimensions of the model’s feature space.9 By comparison, feature selection denotes techniques for selecting a subset of the most relevant features to represent a model. Both feature extraction and selection are forms of dimensionality reduction, and so suitable for regression problems with a large number of features and limited available data samples.
Principal component analysis. Principal component analysis (PCA) is a common feature extraction method that combines and transforms a dataset’s original features to produce new features, called principal components. PCA selects a subset of variables from a model that together comprise the majority or all of the variance present in the model’s original set of variables. PCA then projects data onto a new space defined by this subset of variables.10
Linear discriminant analysis. Linear discriminant analysis (LDA) is ostensibly similar to PCA in that it projects model data onto a new, lower dimensional space. As in PCA, this model space’s dimensions (or features) are derived from the initial model’s features. LDA differs from PCA, however, in its concern for retaining classification labels in the original dataset. While PCA produces new component variables meant to maximize data variance, LDA produces component variables primarily intended to maximize class difference in the data.11
Certain features have upper and lower bounds intrinsic to data that limits possible feature values, such as time-series data or age. But in many cases, model features may not have a limitation on possible values, and such large feature scales (being the difference between a features lowest and highest values) can negatively affect certain models. Feature scaling (sometimes called feature normalization) is a standardization technique to rescale features and limit the impact of large scales on models.12 While feature transformation transforms data from one type to another, feature scaling transforms data in terms of range and distribution, maintaining its original data type.13
Min-max scaling. Min-max scaling rescales all values for a given feature so that they fall between specified minimum and maximum values, often 0 and 1. Each data point’s value for the selected feature (represented by x) is computed against the decided minimum and maximum feature values, min(x) and max(x) respectively, which produces the new feature value for that data point (represented by x̃ ). Min-max scaling is calculated using the formula:14
Z-score scaling. Literature also refers to this as standardization and variance scaling. Whereas min-max scaling scales feature values to fit within designated minimum and maximum values, z-score scaling rescales features so that they have a shared standard deviation of 1 with a mean of 0. Z-score scaling is represented by the formula:
Here, a given feature value (x) is computed against the rescaled feature’s mean and divided by the standardized standard deviation (represented as sqrt(var(x))). Z-scare scaling can be useful when implementing feature extraction methods like PCA and LDA, as these two methods require features to share the same scale.15
Automation. Automated feature engineering, admittedly, has been an ongoing area of research for a few decades.16 Python libraries such as "tsflex" and "featuretools" help automate feature extraction and transformation for time series data. Developers continue to provide new packages and algorithms to automate feature engineering for linear regression models and other data types that increase model accuracy.17 More recently, automated feature engineering has figured as one part of larger endeavors to build automated machine learning (AutoML) systems, which aim to make machine learning more accessible to non-experts.18
Deep learning. Feature engineering can be a laborious and time-consuming process, involving a significant amount of trial and error. Deep learning allows the user to specify a small set of basic features that the neural network architecture aggregates into higher-level features, also called representations.19 One such example is computer vision image processing and pattern recognition, in which a model learns to identify semantically meaningful objects (for example cars, people, and so on) in terms of simple concepts (for example edges, contours, and so on) by concatenating feature maps.20 Recent studies, however, have combined feature engineering with neural networks and other deep learning techniques classification tasks, such as fraud detection, with promising results.21
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
Explore supervised learning approaches such as support vector machines and probabilistic classifiers.
Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.
Learn how to select the most suitable AI foundation model for your use case.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
1 Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018. Sinan Ozdemir and Divya Susarla, Feature Engineering Made Easy, Packt, 2018.
2 Yoav Goldberg, Neural Network Methods for Natural Language Processing, Springer, 2022.
3 Suhang Wang, Jiliang Tang, and Huan Liu, “Feature Selection,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.
4 Sinan Ozdemir, Feature Engineering Bookcamp, Manning Publications, 2022. Sinan Ozdemir and Divya Susarla, Feature Engineering Made Easy, Packt, 2018.
5 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.
6 Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.
7 Jaiwei Han, Data Mining: Concepts and Techniques, 3rd edition, 2012.
8 Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012. Soledad Galli, Python Feature Engineering Cookbook, 2nd edition, Packt, 2022.
9 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.
10 I.T. Jollife, Principal Component Analysis, Springer, 2002.
11 Chris Albon, Machine Learning with Python Cookbook, O’Reilly, 2018.
12 Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.
13 Zahraa Abdallah, Lan Du, and Geoffrey Webb, “Data preparation,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.
14 Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.
15 Zahraa Abdallah, Lan Du, and Geoffrey Webb, “Data preparation,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017. Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.
16 James Kanter and Kalyan Veeramachaneni, "Deep feature synthesis: Towards automating data science endeavors," IEEE International Conference on Data Science and Advanced Analytics, 2015, https://ieeexplore.ieee.org/document/7344858.
17 Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy, "Cognito: Automated Feature Engineering for Supervised Learning," IEEE 16th International Conference on Data Mining Workshops, 2016, pp. 1304-130, https://ieeexplore.ieee.org/abstract/document/7836821. Franziska Horn, Robert Pack, and Michael Rieger, "The autofeat Python Library for Automated Feature Engineering and Selection," Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019, pp. 111-120, https://link.springer.com/chapter/10.1007/978-3-030-43823-4_10.
18 Ahmad Alsharef, Karan Aggarwal, Sonia, Manoj Kumar, and Ashutosh Mishra, "Review of ML and AutoML Solutions to Forecast Time‑Series Data," Archives of Computational Methods in Engineering, Vol. 29, 2022, pp. 5297–5311, https://link.springer.com/article/10.1007/s11831-022-09765-0. Sjoerd Boeschoten, Cagatay Catal, Bedir Tekinerdogan, Arjen Lommen, and Marco Blokland, "The automation of the development of classification models andimprovement of model quality using feature engineering techniques," Expert Systems with Applications, Vol. 213, 2023, https://www.sciencedirect.com/science/article/pii/S0957417422019303. Shubhra Kanti Karmaker, Mahadi Hassan, Micah Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni, "AutoML to Date and Beyond: Challenges and Opportunities," ACM Computing Surveys, Vol. 54, No. 8, 2022, pp. 1-36, https://dl.acm.org/doi/abs/10.1145/3470918.
19 Yoav Goldberg, Neural Network Methods for Natural Language Processing, Springer, 2022.
20 Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, https://www.deeplearningbook.org/
21 Xinwei Zhang, Yaoci Han, Wei Xu, and Qili Wang, "HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture," Information Sciences, Vol. 557, 2021, pp. 302-316, https://www.sciencedirect.com/science/article/abs/pii/S002002551930427X. Daniel Gibert, Jordi Planes, Carles Mateu, and Quan Le, "Fusing feature engineering and deep learning: A case study for malware classification," Expert Systems with Applications, Vol. 207, 2022, https://www.sciencedirect.com/science/article/pii/S0957417422011927. Ebenezerm Esenogho, Ibomoiye Domor Mienye, Theo Swart, Kehinde Aruleba, and George Obaido, "A Neural Network Ensemble With Feature Engineering for Improved Credit Card Fraud Detection," IEEE Access, Vol. 10, 2020, pp. 16400-16407, https://ieeexplore.ieee.org/abstract/document/9698195.