Given a model is only as good as the data on which it is based, data scientists spend a large portion of time on data preparation and feature creation in order to create high-quality models. Depending on the complexity of one’s raw data and the desired predictive model, feature engineering can require much trial and error.
A handful of sources and online tutorials break feature engineering down into discrete steps, the number and names of which typically vary. These steps can include feature understanding, structuring or construction, transformation, evaluation, optimization and the list goes on.4 While such stratification can be useful for providing a general overview of the tasks involved in featuring engineering, it suggests that feature engineering is a linear process. In actual fact, feature engineering is an iterative process.
Feature engineering is context-dependent. It requires substantial data analysis and domain knowledge. This is because effective encoding for features can be determined by the type of model used, the relationship between predictors and output, as well as the problem a model is intended to address.5 This is coupled by the fact that different kinds of datasets—for example text versus images—may be better suited for different feature engineering techniques.6 Thus, it can be difficult to make specific remarks on how to best implement feature engineering within a given machine learning algorithms.