2nd Prize Winning Solution to Web Traffic Forecasting competition on Kaggle
JeanFrancoisPuget 2700028FGP Comments (5) Visits (11678)
I'm very proud to have finished 2nd in the latest Kaggle competition, organized by Google Research. Pardon my team name, but the joke was too tempting given this was a Web Traffic Forecasting competition .
The competition was about predicting number of visits for Wikipedia pages. Here is a short description of the competition, from Kaggle site.
The training dataset consists of approximately 145k time series. Each of these time series represent a number of daily views of a different Wikipedia article, starting from July, 1st, 2015 up until December 31st, 2016. The leaderboard during the training stage is based on traffic from January, 1st, 2017 up until March 1st, 2017.
The second stage will use training data up until September 1st, 2017. The final ranking of the competition will be based on predictions of daily views between September 13th, 2017 and November 13th, 2017 for each article in the dataset. You will submit your forecasts for these dates by September 12th.
For each time series, you are provided the name of the article as well as the type of traffic that this time series represent (all, mobile, desktop, spider). You may use this metadata and any other publicly available data to make predictions. Unfortunately, the data source for this dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day.
To reduce the submission file size, each page and date combination has been given a shorter Id. The mapping between page names and the submission Id is given in the key files.
My solution is a combination of deep learning and xgboost. I described it here. What made that competition quite challenging is that some participants quickly found that the median of the visits for the previous 7 weeks was a very good predictor of future visits. Doing better than that was quite challenging, and only 50 participants or so over 1,000 managed to beat the median benchmark.
Another difficulty was the metric used to evaluate submissions. It is the SMAPE metric, which is a rather weird metric. What makes it weird is that it is discontinuous at 0, and it is non convex. I discuss its weirdness at length in this notebook.
A third difficulty was that we had about 24 hours only between the release of final train data and the competition deadline. It meant that we had to do all our data processing, all our model training, and all prediction computation within a short lapse of time, and had to share that time with mundane activities like sleep, eating, day job, etc. I carefully planned for that, and only had to run predefined code. My fear was to introduce a last minute bug during that final 24 period. I didn't thankfully, but other strong competitors lost their chance to win a prize because of last minute bugs.
I learned a lot during this competition. In particular, it was the first time I used deep learning seriously. I will share my code on github very soon, but don't expect very sophisticated deep learning models in it. My model is rather clumsy and full of beginner shortcomings. I compensated with feature engineering and carefully designed cross validation setting.
Edited on Nov 17, 2017. The code for this solution is available on GitHub.