
Data Science Is Not Dead
Is data science dead really? One can wonder after reading Jeroen ter Heerdt's Data Science is dead . If you haven't read it then you probably should. Jeroen's point is that lots of business use cases for Data Science are now served by cloud services that are very simple to use for a non data scientist. This is true. Lost of companies now provide apis that rely on some machine learning models internally. IBM is no exception with its Watson Developer Cloud apis. ... [More]
|
Kaggle Master
Do you have spare time on evenings and week ends? Here is a great way to use it: enter machine learning competitions. That's what I do since a year, as often as I can. The latest competition I entered, the Quora competition on Kaggle , was quite good for me as my team finished in gold, being 12th among more than 3,300 teams . Here is how Quora describes the problem: Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly... [More]
|
RuleML Keynote
My colleague Eric Mazeran gave a keynote on ML, Optimization and Rules : time for agility and convergences at the Rule ML conference . I co authored the material with him, and the slides can be found here . It was well received as it gives a global view on how these three technologies can be used together. I'd like to comment on one slide of his presentation (slide 15 if you download the deck). Here is a slightly modified version of it: It captures what I believe is the ideal data science project. It starts with... [More]
|
What Is Artificial Intelligence?
Here is a question I was asked to discuss at a conference last month: what is Artifical Intelligence (AI)? Instead of trying to answer it, which could take days, I decided to focus on how AI has been defined over the years. Nowadays, most people probably equate AI with deep learning. This has not always been the case as we shall see. Most people say that AI was first defined as a research field in a 1956 workshop at Dartmouth College. Reality is that is has been defined 6 years earlier by Alan Turing in 1950. Let... [More]
|
Just label data!
Machine Learning and Deep Learning are very promising technologies. Every week comes with its new hyped successes. Yet, when it comes to applying machine learning and deep learning many people keep making the same mistakes. Here is one that is particularly troublesome: people often miss that you need to provide examples to learn from. They expect systems to learn from raw data without any supervision or feedback. I can't blame them, as many proponents of machine learning, deep learning, or artificial... [More]
|
Fast Computation of AUC-ROC score
Area under ROC curve (AUC-ROC) is one of the most common evaluation metric for binary classification problems. We show here a simple and very efficient way to compute it with Python. Before showing the code, let's briefly describe what an evaluation metric is, and what AUC-ROC is in particular. An evaluation metric is a way to assess how good a machine learning model is. It is used to compute one or more numbers that summarize how the machine learning model predictions compare to reality. In order to use... [More]
|
2nd Prize Winning Solution to Web Traffic Forecasting competition on Kaggle
I'm very proud to have finished 2nd in the latest Kaggle competition, organized by Google Research. Pardon my team name, but the joke was too tempting given this was a Web Traffic Forecasting competition . The competition was about predicting number of visits for Wikipedia pages. Here is a short description of the competition, from Kaggle site . The training dataset consists of approximately 145k time series. Each of these time series represent a number of daily views of a different Wikipedia article,... [More]
|
Python List Comprehensions
Do you write loops over array indices in Python? This would not be surprising if you have good programming skill in programming languages such as C, C++, or Java. Issue is that loops over array indices really hurt performance in Python unless you compile your code with Cython or Numba. Alright, you heard this before: one should not loop over array indices in Python. What can you use then? Depending on the task to be performed, Python offers a variety of approaches. I would like to describe one that is underused... [More]
|
Gold Medal Winning Solution to Sales Forecasting Kaggle Competition
I had the pleasure to team with Kaggle grandmaster Giba, aka Gilberto Titericz Junior, currently ranked 1st on Kaggle . We teamed for a sales forecasting competition, namely the Corporación Favorita competition . Corporación Favorita is a retailer from Ecuador. The problem was to forecast sales for all stores and a large selection of products for the next 16 days. We were given past sales figures, as well as a number of additional data on stores, products, and holidays in... [More]
|
Implementing libFM in Keras
I just won a gold medal on Talking Data competition on Kaggle , finishing 6th. My approach and solution is described here. The part that triggered most interest from readers is where I used matrix factorization techniques to generate additional features. I'll explain it here. Before that, let me briefly explain what this competition was about. Here is how the problem is described on Kaggle site: you’re challenged to build an algorithm that predicts whether a user will download an app... [More]
|