AutoAI wins AIconics Intelligent Automation Award: Meet a key inventor
AutoAI, a powerful, automated AI development capability in IBM Watson Studio, won the Best Innovation in Intelligent Automation Award at the AIconics AI Summit in San Francisco. Chosen by a panel of 13 independent judges, the AIconics awards recognize breakthroughs in AI for business.
To share what went behind the development of AutoAI and how it accelerates time to value with data science projects, I interviewed one of our principal inventors: Jean-Francois Puget, PhD, a distinguished engineer for machine learning and optimization at IBM and a two-time Kaggle Grandmaster.
What challenge led you to start developing AutoAI?
Jean-Francois Puget: As data scientists, our work is a mix of applying general-purpose recipes and creating domain-specific insights. The recipe portion involves repetitive and tedious tasks that beg to be automated. We always end up trying the same feature engineering tricks and running the same set of algorithms to test them. These are time-consuming steps, and they reduce a data scientist’s opportunity to do higher value work such as looking for domain-specific approaches.
The objective of AutoAI is to free data scientists from mundane tasks so that they start from a solid pipeline of models ranked on efficiency—and they can add more value by focusing on the specifics of the problem to be solved.
How does AutoAI work? How is AI “used” to build AI?
AutoAI starts with a dataset. The dataset is like a table, where each row represents a sample to learn from, and each column is a feature. For instance, the rows of data might represent individuals, and the columns could include gender, age, occupation, revenue or loan amount.
We designate a column as the target that we want to predict using data in the other columns. For instance, the target could be the approval of a loan application for each and every individual. Creating a machine learning model that predicts the target accurately from the other columns can be both time-consuming and require very experienced data scientists.
AutoAI simplifies the challenge. You just provide the dataset, and indicate which column is the target. Then AutoAI automatically performs data preparation, feature engineering, machine learning algorithm selection, and hyper-parameter optimization to find the best possible machine learning model. This series of steps is guided by an AI system that drives towards the most promising steps at each stage of the process. This is really AI that builds AI.
What kind of feedback are you getting from the marketplace about AutoAI?
One data scientist wrote us that he felt as if it saved his job. He used AutoAI to run 80 experiments in about 10 hours over a weekend, and he was able to provide a global pharmaceutical company with insights on their financial expenses that Monday.
“I presented to a senior finance director, and she was very happy,” he wrote. “It would not have been possible without AutoAI. If I had to run all the possible models manually to narrow down the results and get a sound model, there simply would not have been enough time.”
In general, AutoAI is seen as a game changer. I think it is the next frontier in data science and machine learning.
Can you characterize the amount of time that AutoAI saves?
AutoAI builds a strong baseline model in minutes or hours, depending on the size of your dataset. I often say that in a couple of hours you get what a seasoned data scientist would produce in a few weeks. If does not mean that a seasoned data scientist cannot do better than AutoAI with more time. It rather means that the data scientists have two extra weeks to refine and improve the models. To make this next stage even smoother, AutoAI can generate Python code that the data scientist can use to jump start the work.
How long did it take you to develop AutoAI?
I did not develop this alone. I have been working with a small IBM Research team for a few years now. We combined our experience on similar projects in the past, and my machine learning (ML) practitioner experience has been gained from customer projects and machine learning competitions. We commercialized what we developed as part of Watson Studio.
What was the biggest challenge in AutoAI development? Roughly how many approaches did you test?
The main challenge is to avoid what is known as ‘overfitting’. Overfit models look good on the data you have—the data input to AutoAI, but they perform badly on new, unseen data. That what makes machine learning an art. You want to create models that will perform well on data you haven’t seen.
How did you feel when you had an approach that worked?
We used AutoAI in a top Kaggle competition, and it finished in the top 10 percent, which is an amazing result. When I saw that, I told all my colleagues that we had something we could sell.
Please describe the Kaggle competition to someone who doesn’t know about it.
A Kaggle competition is a bit like an AutoAI challenge: Kaggle provides a dataset to participants and ask them to create models that predict a target from other data. Then they ask participants to use their model on another data set, test the data, and submit their prediction to the Kaggle site. An automated scoring service evaluates how good predictions are, and it ranks participants according to their score. There are additional bells and whistles to detect overfitting – like the private test dataset – but the main principle is just that: submit predictions on test data and get a score computed automatically.
There is no human in the loop for evaluating participant results. That’s what I like: numbers don’t lie. If your model is better, it is because it scores better on test data than other models.
That said, Kaggle competitions only assess your skills on data preparation, modeling, and model selection. Kaggle does not assess other data scientist skills like data collection, data cleaning, communicating with stakeholders to understand the problem they want you to solve, and communicating machine learning results back to them in ways they can relate to their business problem.
In short, Kaggle assesses only a portion of data scientist skills, but every good data scientist must master the skills it tests.
Kaggle is not the only platform-hosting machine learning challenge, but it is by far the most popular, and the true leader. Some countries, especially China, have a number of Kaggle-like platforms, but they mostly involve in-country participants.
What is a Kaggle Grandmaster and what did you learn achieving that title twice?
Kaggle is a community of over three million. It is by far the largest data scientist community worldwide. There are only 153 competition grandmasters among the three million. The grandmasters are the very best, and they need to show sustained performance by winning a gold medal in five different competitions. I am one of these competition grandmasters and I have 10 gold medals, currently ranking me about 20th globally.
Kaggle has created two other grandmaster categories, one for notebooks, or code that people can up-vote if they like it. The third category is discussion grandmaster. When you post on Kaggle competition forums you can get up-votes or down-votes, and your most popular posts can get points and medals. It just so happens that my answers are the most popular on Kaggle forums, making me the first, and at this point the top discussion grandmaster. I am better at discussions, where I am first overall, than in competitions where I am 20th overall.
What interested you about machine learning? Why did you choose it for a PhD?
I first became interested when I was playing Othello (also known as Reversi), a board game similar to chess, or to Go. I was quite good at Othello actually, even became team world champion one year. As a hobby, I started developing software that plays the game. This led me to read research papers on the topic, and I soon found that the first-ever use of machine learning was for an algorithm that learned how to play checkers. It was developed by an IBMer named Arthur Samuel in 1959. From there, I started reading about machine learning and was hooked quickly.
What is your advice to others about developing their machine learning skills?
Take some good courses and then practice, either on a customer or personal project, or on challenges such as Kaggle competitions. Kaggle competitions are a great learning resource, as you can learn from what others do, and also get direct feedback on your own performance.
How would you sum up IBM Watson Studio?
IBM Watson Studio and related solutions such as Watson Machine Learning and Watson OpenScale tackle AI lifecycle management. Watson Studio with AutoAI helps you ingest data and prepare it, and create models and evaluate them using state of the art open source tools such as TensorFlow, PyTorch, and XGBoost. Watson Machine Learning provides the engine to run your models in production. And Watson OpenScale monitors the models you have in production to help you debias them, control a drift in accuracy, and explain individual outcomes, enhancing compliance.
Learn more about AutoAI and try a guided AutoAI tutorial to see how it helps you build a binary classification model. And view a short video on how analyst firm ESG validated that Watson Studio and Watson Machine Learning can collect data and analyze insights at scale, speeding the monetization of AI.