I just won a gold medal for my 9th rank in the TrackML challenge hosted on Kaggle. That challenge was proposed by CERN (Centre Européen de Recherche Nucléaire). The problem was to reconstruct the trajectories of high energy physics particle from the tracks they leave in the detectors used at CERN. The data we were given was actually a simulation of forthcoming detectors at CERN. Here is how the challenge is described by CERN:
To explore what our universe is made of, scientists at CERN are colliding protons, essentially recreating mini big bangs, and meticulously observing these collisions with intricate silicon detectors.
While orchestrating the collisions and observations is already a massive scientific accomplishment, analyzing the enormous amounts of data produced from the experiments is becoming an overwhelming challenge.
Event rates have already reached hundreds of millions of collisions per second, meaning physicists must sift through tens of petabytes of data per year. And, as the resolution of detectors improve, ever better software is needed for real-time pre-processing and filtering of the most promising events, producing even more data.
To help address this problem, a team of Machine Learning experts and physics scientists working at CERN (the world largest high energy physics laboratory), has partnered with Kaggle and prestigious sponsors to answer the question: can machine learning assist high energy physics in discovering and characterizing new particles?
Specifically, in this competition, you’re challenged to build an algorithm that quickly reconstructs particle tracks from 3D points left in the silicon detectors.
I used an unsupervised machine learning technique known as clustering, see my detailed writeup. The key was to preprocess data so that the clustering algorithm could find the particle tracks more easily. This is very similar to feature engineering for supervised machine learning. The main issue was the computation ressources required by my solution: about 20 cpu hours per event, and we need to predict tracks for 125 events. This would not have been possible without the use of parallelism on multicore machines. I used an IBM Power 9 machine with 40 cores in order to be able to compute a submission in less than 3 days.
Many participants also had significant running times, for instance the one who finished second says that some events take him up to 3 cpu days. This makes the achievement of the first place winner extremely impressive. Not only do they achieve an amazing detection rate, but their code runs in 8 minutes per event! I'm not in the same league as them.
Anyway, I'm still happy with my result. And the cherry on the cake: this gold medal earned me the Kaggle Competitions Grandmaster title I'm the second Kaggler to become a Grandmaster in two categories.