Data scientists are among the most sought after professionals in the IT industry. In 2012, Harvard Business Review called data scientist “the sexiest job of the 21st century.” The professionals turn raw data into actionable insights that help drive business value or, sometimes, even disrupt industries or create entirely new ones. Surprisingly, however, data scientists spend the majority of their time on low-level tasks such as collecting, cleaning and organizing data. According to Forbes, up to 70 percent of typical data science projects is spent on such tasks.
Figure 1: Typical workflow in a Data Science project
Figure 1 shows the typical workflow of a Data Science project. The initial step is the formulation of the business problem to be solved. The next two steps involve acquiring, cleaning and curating relevant data. Feature engineering transforms raw data into numerical or categorical values (so-called features) that can be used as inputs for machine learning models. The machine learning models themselves are selected and fine-tuned in the last step.
The feature engineering step is particularly time-consuming and tedious. In our experience, it can take days or even weeks, even in short-term projects. Often the raw data are stored across various tables in a relational database and need to be combined in various ways. Given a rich set of features, there exists a variety of methods to select the optimal subset of features and optimize the machine learning models accordingly. Hence, if we could automate the feature engineering process (to a large extent, at least), this would dramatically speed up the creation of machine learning models on new data sets and in new application domains.
Cognitive Data Science Team: (L-R):Francesco Vigliaturo, Thanh Lam Hoang, Ambrish Rawat, Francesco Fusco, Valentina Zantedeschi, Maria-Irina Nicolae, Minh Tran, Vincent Lonij and Mathieu Sinn.
A team of IBM researchers in Ireland have completed the first phase of a project that aids in automating the feature engineering step at the push of a button. Called the “One Button Machine” project, it computes aggregate features that can be used as input for machine learning models.
The team has successfully applied the One Button Machine in various data science competitions where it outperformed most human teams and ranked among the top 16-24 percent of participants. In a client project with a social service provider from the U.S., it helped improve the accuracy of a complex classification task (involving a database with more than 20 tables) from 57 percent to 64 percent. One Button Machine produced the results within a few hours of effort whereas if the features had to be manually engineered, it would have taken days or even weeks to get to the same levels of accuracy.
One Button Machine works by traversing the graph defined by the entities (tables) and relations (primary/foreign keys) of a relational database. The aggregation functions can be specified by the user, or chosen generically for certain data types. To deal with the combinatorial explosion of related entities, the One Button Machine deploys heuristics and sub-sampling strategies. Scalability to big databases is achieved by dynamic caching of intermediate results and a parallelizable implementation in Apache Spark, a distributed computing framework for analyzing massive amounts of data.
The team is working on improving feature detection for unstructured data as well as integration with algorithms optimal for feature selection and optimization of the machine learning models.
Our vision for the future is to build cognitive agents which serve as autonomous assistants for data scientists. Such agents will take over the most tedious and time-consuming tasks. A key capability of those agents will be to understand and reason about the application domain, and be able to automatically detect, diagnose and resolve inconsistencies in the data. If the agents encounter inconsistencies that cannot be resolved, they will ask for specific feedback from data scientists and update their domain knowledge accordingly. This will finally give data scientists more time to think about actual business challenges, develop creative solutions, and communicate actionable insights to stakeholders at the right time.