What is a feature engineering?

Published: 20 January 2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.

Feature engineering is the process of transforming raw data into relevant information for use by machine learning models. In other words, feature engineering is the process of creating predictive model features. A feature—also called a dimension—is an input variable used to generate model predictions. Because model performance largely rests on the quality of data used during training, feature engineering is a crucial preprocessing technique that requires selecting the most relevant aspects of raw training data for both the predictive task and model type under consideration.¹

Before proceeding, a quick note on terminology. Many sources use feature engineering and feature extraction interchangeably to denote the processing of creating model variables.² At times, sources also use feature extraction to refer to remapping an original feature space onto a lower-dimensional feature space.³ Feature selection, by contrast, is a form of dimensionality reduction. Specifically, it is the processing of selecting a subset of variables in order to create a new model with the purpose of reducing multicollinearity, and so maximize model generalizability and optimization.

Take a tour of IBM watsonx

Explore IBM watsonx and learn how to create machine learning models using statistical datasets.

Related content

Subscribe to the IBM newsletter

Feature engineering process

Given a model is only as good as the data on which it is based, data scientists spend a large portion of time on data preparation and feature creation in order to create high-quality models. Depending on the complexity of one’s raw data and the desired predictive model, feature engineering can require much trial and error.

A handful of sources and online tutorials break feature engineering down into discrete steps, the number and names of which typically vary. These steps can include feature understanding, structuring or construction, transformation, evaluation, optimization, and the list goes on.⁴ While such stratification can be useful for providing a general overview of the tasks involved in featuring engineering, it suggests that feature engineering is a linear process. In actual fact, feature engineering is an iterative process.

Feature engineering is context-dependent. It requires substantial data analysis and domain knowledge. This is because effective encoding for features can be determined by the type of model used, the relationship between predictors and output, as well as the problem a model is intended to address.⁵ This is coupled by the fact that different kinds of datasets—e.g. text versus images—may be better suited for different feature engineering techniques.⁶ Thus, it can be difficult to make specific remarks on how to best implement feature engineering within a given machine learning algorithms.

Feature engineering techniques

Although there is no universally preferred feature engineering method or pipeline, there are a handful of common tasks used to create features from different data types for different models. Before implementing any of these techniques, however, one must remember to conduct a thorough data analysis to determine both the relevant features and appropriate number of features for addressing a given problem. Additionally, it is best to implement various data cleaning and preprocessing techniques, such as imputation for missing data or missing values, while also addressing outliers that can negatively impact model predictions.

Feature transformation

Feature transformation is the process of converting one feature type into another, more readable form for a particular model. This consists of transforming continuous into categorical data, or vice-versa.

Binning. This technique essentially transforms continuous, numerical values into categorical features. Specifically, binning compares each value to the neighborhood of values surrounding it and then sorts data points into a number of bins. A rudimentary example of binning is age demographics, in which continuous ages are divided into age groups, e.g. 18-25, 25-30, etc. Once values have been placed into bins, one can further smooth the bins by means, medians, or boundaries. Smoothing bins replaces a bin’s contained values with bin-derived values. For instance, if we smooth a bin containing age values between 18-25 by the mean, we replace each value in that bin with the mean of that bin’s values. Binning creates categorical values from continuous ones. Smoothing bins is a form of local smoothing meant to reduce noise in input data.⁷

One-hot encoding. This is the inverse of binning; it creates numerical features from categorical variables. One-hot encoding maps categorical features to binary representations, which are used to map the feature in a matrix or vector space. Literature often refers to this binary representation as a dummy variable. Because one-hot encoding ignores order, it is best used for nominal categories. Bag of words models are an example of one-hot encoding frequently used in natural language processing tasks. Another example of one-hot encoding is spam filtering classification in which the categories spam and not spam are converted to 1 and 0 respectively.⁸

Feature extraction and selection

Feature extraction is a technique for creating a new dimensional space for a model by combining variables into new, surrogate variables or in order to reduce dimensions of the model’s feature space.⁹ By comparison, feature selection denotes techniques for selecting a subset of the most relevant features to represent a model. Both feature extraction and selection are forms of dimensionality reduction, and so suitable for regression problems with a large number of features and limited available data samples.

Principal component analysis. Principal component analysis (PCA) is a common feature extraction method that combines and transforms a dataset’s original features to produce new features, called principal components. PCA selects a subset of variables from a model that together comprise the majority or all of the variance present in the model’s original set of variables. PCA then projects data onto a new space defined by this subset of variables.¹⁰

Linear discriminant analysis. Linear discriminant analysis (LDA) is ostensibly similar to PCA in that it projects model data onto a new, lower dimensional space. As in PCA, this model space’s dimensions (or features) are derived from the initial model’s features. LDA differs from PCA, however, in its concern for retaining classification labels in the original dataset. While PCA produces new component variables meant to maximize data variance, LDA produces component variables primarily intended to maximize class difference in the data.¹¹

Feature scaling

Certain features have upper and lower bounds intrinsic to data that limits possible feature values, such as time-series data or age. But in many cases, model features may not have a limitation on possible values, and such large feature scales (being the difference between a features lowest and highest values) can negatively affect certain models. Feature scaling (sometimes called feature normalization) is a standardization technique to rescale features and limit the impact of large scales on models.¹² While feature transformation transforms data from one type to another, feature scaling transforms data in terms of range and distribution, maintaining its original data type.¹³

Min-max scaling. Min-max scaling rescales all values for a given feature so that they fall between specified minimum and maximum values, often 0 and 1. Each data point’s value for the selected feature (represented by x) is computed against the decided minimum and maximum feature values, min(x) and max(x) respectively, which produces the new feature value for that data point (represented by x̃ ). Min-max scaling is calculated using the formula:¹⁴

Z-score scaling. Literature also refers to this as standardization and variance scaling. Whereas min-max scaling scales feature values to fit within designated minimum and maximum values, z-score scaling rescales features so that they have a shared standard deviation of 1 with a mean of 0. Z-score scaling is represented by the formula:

Here, a given feature value (x) is computed against the rescaled feature’s mean and divided by the standardized standard deviation (represented as sqrt(var(x))). Z-scare scaling can be useful when implementing feature extraction methods like PCA and LDA, as these two methods require features to share the same scale.¹⁵

Recent research

Automation. Automated feature engineering, admittedly, has been an ongoing area of research for a few decades.¹⁶ Python libraries such as "tsflex" and "featuretools" help automate feature extraction and transformation for time series data. Developers continue to provide new packages and algorithms to automate feature engineering for linear regression models and other data types that increase model accuracy.¹⁷ More recently, automated feature engineering has figured as one part of larger endeavors to build automated machine learning (AutoML) systems, which aim to make machine learning more accessible to non-experts.¹⁸

Deep learning. Feature engineering can be a laborious and time-consuming process, involving a significant amount of trial and error. Deep learning allows the user to specify a small set of basic features that the neural network architecture aggregates into higher-level features, also called representations.¹⁹ One such example is computer vision image processing and pattern recognition, in which a model learns to identify semantically meaningful objects (e.g. cars, people, etc.) in terms of simple concepts (e.g. edges, contours, etc.) by concatenating feature maps.²⁰ Recent studies, however, have combined feature engineering with neural networks and other deep learning techniques classification tasks, such as fraud detection, with promising results.²¹

Related resources

Automated feature engineering for relational data

Perform feature engineering tasks automatically with IBM AutoAI in Cloud Pak for Data.

Automatic feature extraction based on deep CNNs

IBM researchers present unsupervised feature extraction for archaeological images of pottery vessels.

Take the next step

Build an AI strategy for your business on one collaborative AI and data platform—IBM watsonx. Train, validate, tune and deploy AI models to help you scale and accelerate the impact of AI with trusted data across your business.

Explore watsonx

Book a live demo

Footnotes

¹ Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018. Sinan Ozdemir and Divya Susarla, Feature Engineering Made Easy, Packt, 2018.

² Yoav Goldberg, Neural Network Methods for Natural Language Processing, Springer, 2022.

³ Suhang Wang, Jiliang Tang, and Huan Liu, “Feature Selection,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.

⁴ Sinan Ozdemir, Feature Engineering Bookcamp, Manning Publications, 2022. Sinan Ozdemir and Divya Susarla, Feature Engineering Made Easy, Packt, 2018.

⁵ Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

⁶ Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.

⁷ Jaiwei Han, Data Mining: Concepts and Techniques, 3^rd edition, 2012.

⁸ Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012. Soledad Galli, Python Feature Engineering Cookbook, 2^nd edition, Packt, 2022.

⁹ Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

¹⁰ I.T. Jollife, Principal Component Analysis, Springer, 2002.

¹¹ Chris Albon, Machine Learning with Python Cookbook, O’Reilly, 2018.

¹² Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.

¹³ Zahraa Abdallah, Lan Du, and Geoffrey Webb, “Data preparation,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.

¹⁴ Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.

¹⁵ Zahraa Abdallah, Lan Du, and Geoffrey Webb, “Data preparation,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017. Alice Zheng and Amanda Casari, Feature Engineering for Machine Learning, O’Reilly, 2018.

¹⁶ James Kanter and Kalyan Veeramachaneni, "Deep feature synthesis: Towards automating data science endeavors," IEEE International Conference on Data Science and Advanced Analytics, 2015, https://ieeexplore.ieee.org/document/7344858 (link resides outside of ibm.com).

¹⁷ Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy, "Cognito: Automated Feature Engineering for Supervised Learning," IEEE 16th International Conference on Data Mining Workshops, 2016, pp. 1304-130, https://ieeexplore.ieee.org/abstract/document/7836821 (link resides outside of ibm.com). Franziska Horn, Robert Pack, and Michael Rieger, "The autofeat Python Library for Automated Feature Engineering and Selection," Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019, pp. 111-120, https://link.springer.com/chapter/10.1007/978-3-030-43823-4_10 (link resides outside of ibm.com).

¹⁸ Ahmad Alsharef, Karan Aggarwal, Sonia, Manoj Kumar, and Ashutosh Mishra, "Review of ML and AutoML Solutions to Forecast Time‑Series Data," Archives of Computational Methods in Engineering, Vol. 29, 2022, pp. 5297–5311, https://link.springer.com/article/10.1007/s11831-022-09765-0 (link resides outside of ibm.com). Sjoerd Boeschoten, Cagatay Catal, Bedir Tekinerdogan, Arjen Lommen, and Marco Blokland, "The automation of the development of classification models andimprovement of model quality using feature engineering techniques," Expert Systems with Applications, Vol. 213, 2023, https://www.sciencedirect.com/science/article/pii/S0957417422019303 (link resides outside of ibm.com). Shubhra Kanti Karmaker, Mahadi Hassan, Micah Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni, "AutoML to Date and Beyond: Challenges and Opportunities," ACM Computing Surveys, Vol. 54, No. 8, 2022, pp. 1-36, https://dl.acm.org/doi/abs/10.1145/3470918 (link resides outside of ibm.com).

¹⁹ Yoav Goldberg, Neural Network Methods for Natural Language Processing, Springer, 2022.

²⁰ Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, https://www.deeplearningbook.org/ (link resides outside ibm.com)

²¹Xinwei Zhang, Yaoci Han, Wei Xu, and Qili Wang, "HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture," Information Sciences, Vol. 557, 2021, pp. 302-316, https://www.sciencedirect.com/science/article/abs/pii/S002002551930427X (link resides outside of ibm.com). Daniel Gibert, Jordi Planes, Carles Mateu, and Quan Le, "Fusing feature engineering and deep learning: A case study for malware classification," Expert Systems with Applications, Vol. 207, 2022, https://www.sciencedirect.com/science/article/pii/S0957417422011927 (link resides outside of ibm.com). Ebenezerm Esenogho, Ibomoiye Domor Mienye, Theo Swart, Kehinde Aruleba, and George Obaido, "A Neural Network Ensemble With Feature Engineering for Improved Credit Card Fraud Detection," IEEE Access, Vol. 10, 2020, pp. 16400-16407, https://ieeexplore.ieee.org/abstract/document/9698195 (link resides outside of ibm.com).