JeanFrancoisPuget 2700028FGP Visits (7232)
Do you have spare time on evenings and week ends? Here is a great way to use it: enter machine learning competitions. That's what I do since a year, as often as I can.
Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Currently, Quora uses a Random Forest model to identify duplicate questions. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.
The problem was to predict, given a pair of questions, if the two questions are about the same topic or not.
We were given approximately 400,000 question pairs to train our models. For each pair we known the target value: 1if the two questions are duplicates, 0 if they are not duplicates. The test set was made of about 2,400,000 question pairs. We needed to build models that predict which pairs are duplicates among these 2,400,000 pairs.
Competition duration was 3 months. I entered that competition to learn about natural language processing (NLP), a domain entirely unknown to me at the start of the competition.
In order to get to a good result I teamed up with 3 other great data scientists, with varying skil
That competition was one of the most popular ever on Kaggle. It was also a tough one, with the top 9 Kaggle competitors engaged. We had to wait till the very last day when we made a significant progress to get into the top 16 teams and get a gold medal. As a result I got promoted to Kagg
This alone is not worth a blog entry but I'm sharing it because Kaggle enforces a rather sound machine learning methodology, similar to what I described in Be Brave In Machine Learning.
Let me expand a bit on it. For each competition, we are given a training data set, i.e. a set of examples, and for each training example a target value. This is typical of supervised machine learning. A second data set is provided, called the test set. The test set is a set of examples similar to the training set, except that the target is kept hidden from participants. The goal of competitors is to create machine learning models and use these models to predict the target on the test set.
The test set is split into a public test set, and a private test set. The split isn't disclosed to participants. When one submits predictions for the test set, the public test set target is used as benchmark. The participant gets feedback on how his/her predictions fare compared to the true target on the test set. This can be used to select among various models. Competitors need to select 2 of their models before competition ends.
After competition ends, the private test set is used for final evaluation of the selected models. At no point in time participants have any feedback on how their models behave on the private test set.
This process is very close to recommended ML practice. See for instance Ele
If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.
Kaggle public test set plays the role of the validation set, while the Kaggle private test set plays the role of the test set. Entering one of their competition (or competitions hosted by other sites) is a good way to practice the right machine learning methodology.
All in all, this competition has been a great experience. I learned a lot about NLP, and I learned a lot about teaming up for Kaggle. The outcome was great. It was also very time consuming, eating most of my evenings and week ends during the last month. I thank my wife who has been extremely patient and supportive during that period. Last but not least, now that it is over I have more time to blog again