Removing the hunch in data science with AI-based automated feature engineering

Share this post:

For data scientists, predictive modeling is the practice of predicting future outcomes using statistical models. Its increasing adoption in the field of AI includes diagnosing cancer, predicting hurricanes and optimizing supply chains, amongst other areas. However, the value of predictive modeling comes at the cost of practicing it. The cornerstone of successful predictive modeling, known as feature engineering, happens to be the most time consuming step in the data science pipeline, involving a lot of trial and error and relying on hunches by data scientists. Our team has developed a new way of doing it – more efficiently and effectively. We will be presenting our work this week at the International Joint Conference for Artificial Intelligence (IJCAI).

Predictive Modeling pipeline depends heavily on feature engineering





“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” – Pedro Domingos, A few useful things to know about machine learning.

What is feature engineering?

Features are the observations or characteristics on which a model is built. For instance, one model for assessing the risk of heart disease in patients is based on their height, weight, and age, amongst other features. However, the straightforward use of available features, on their own, may not always prove very effective. One of the more effective ways is to compute a ratio of weight to squared height, also known as Body Mass Index (BMI). The process of deriving a new abstract feature based on the given data is broadly referred to as feature engineering. It is typically done using one of the many mathematical or statistical functions called transformations.

What makes the task of effective feature engineering hard is that there are literally a countless number of options of transformations a data scientist could perform. Moreover, working through those options with the trial and error of applying transformations and assessing their impact is very time consuming, often infeasible to perform thoroughly. On the other hand, feature engineering is central to producing good models, which presents a dilemma to a data scientist on how much time to devote to it. Learning Feature Engineering (LFE) presents a novel solution to this problem.

Feature Engineering is done through Transformations, which are essentially mathematical or statistical functions. For instance, Logarithm is a useful tool in dealing with skewed distributions.

What is LFE?

LFE is a new, radically different way of performing feature engineering, compared to the trial and error methodology. Instead, LFE automatically learns which transformations are effective on a feature in the context of a given problem. It does this by examining thousands of predictive learning problems and making associations between the efficacy of a transformation in specific situations. And, when a new problem is presented, it instantly comes up with the most effective feature choices, based on its experience.

Typically, learning any phenomena over multiple different datasets is a big no-go. That is simply because different datasets represent unrelated problems and consist of a varying number of data points, making it infeasible for machine learning algorithms to generalize insights from one dataset to another. However, LFE breaks this barrier thanks to a novel representation of feature vectors, called the Quantile Sketch Array (QSA), which is a canonical form of capturing the essence of a feature’s distribution in a fixed-size array. LFE uses a multilayer perceptron over QSAs to learn which transformations are the most relevant for a target problem. The results show that, besides being more accurate than any other approach, LFE is also the fastest, delivering results in seconds, compared to the typically expected hours or days.

What is the significance of LFE?

Besides reducing the problem of feature engineering from a costly manual process to automated instant predictions, LFE’s significance transcends its original application. The QSA of LFE demonstrated the successful generalization of a complex task across datasets of widely varying size, shape and content. Hence, it has opened an interesting topic of research in the general area of learning to learn, a.k.a. meta-learning. Finally, there is considerable recent interest in the automation of data science and predictive analytics, an industry expected to be worth $200 Billion, by 2020.

As such, we will be integrating LFE into IBM’s Data Science Experience in the coming months making it available for thousands of users of IBM’s premier data science tool kit.

LFE started as the internship project of Fatemeh Nargesian from U. of Toronto with the Automated Machine Learning and Data Science team at IBM Research. The publication can be accessed at:

Learning Feature Engineering for Classification by F. Nargesian, H. Samulowitz, U Khurana, E Khalil, D Turaga in the proceedings of the Twenty Sixth International Conference on Artificial Intelligence (IJCAI), 2017.




More AI stories

Moving beyond the self-reported scale: Objectively measuring chronic pain with AI

Together with Boston Scientific, we are presenting research that details the feasibility and progress towards our new pain measurement method at the 2021 North American Neuromodulation Society Annual Meeting.

Continue reading

How the world’s first smartwatch inspired cutting-edge AI 

Between 2000 and 2001, IBM Research made headlines when it launched an internet-enabled designer watch running Linux, an open-source operating system. Dubbed WatchPad, its aim was to demonstrate the capabilities of the then-novel OS for mobile and embedded devices.

Continue reading

Who. What. Why. New IBM algorithm models how the order of prior actions impacts events

To address the problem of ordinal impacts, our team at IBM T. J. Watson Research Center has developed OGEMs – or Ordinal Graphical Event Models – new dynamic, probabilistic graphical models for events. These models are part of the broader family of statistical and causal models called graphical event models (GEMs) that represent temporal relations where the dynamics are governed by a multivariate point process.

Continue reading