Predicting Cyclist Speed
I have been the 'data scientist' on the IBM team that helped Dave Haase run the Race Across America (RAAM) this year. This project exemplified quite a few of the classical tips of data science documents in The
If you fully agree with tip #7, then this post is not for you and you can move on. For those interested in how we made our predictions, here is the story. It illustrates some of the above tips.
RAAM is a 3000+ mile long cyclist race, from Oceanside, CA, to Annapolis, MD. Total elevation difference is more than 110,000 feet. There are no planned stops, racers stop when they want. The first to join the finish line wins. In practice, racers sleep about 2 hours a day, and race the rest of the time. This is completely insane. Yet people lik
RAAM route Source: RAAM (ht
RAAM elevation profile Source: RAAM (ht
I will focus here on one of the things we did for Dave, namely predicting the speed at which he moves along the race route. This seemed quite simple at first sight. As explained in Mode
Indeed, predicting the trajectory is straightforward when the power used by the cyclist is known. It can be achieved with these steps:
Even as straightforward as it seems, we made a number of assumptions (in brackets). Let us evaluate these.
As a result of these assumptions, our model was quite wrong, as in tip #10. The interesting question was whether our model was useful. There was only one way to know. It was to run the model with real data and compare predictions to the actual way Dave moved along the route. This is what we did. As soon as the race started we ran the model. We stored the predicted time at which Dave was predicted to pass the time stations of the race. That was on a Monday evening for me. When I woke up the next day, then I compared the predictions with actuals. They were different. Significantly. I did expect some discrepancy as we did not calibrate the physics model. This is where additional data came in. We had access to a great data source: the values reported by the Garmin device were stored. For each second during the race , we got I remember how thrilled I was when I got the first dataset, recording the first 5h30 of the race It was a Tuesday morning, and the race had started the eve before. I needed to hurry a bit as I had to start making predictions same day for Dave. All I had was the time series for speed, power, latitude, longitude, and elevation. Each series had about 20,000 values, one per second. The physics model is explained in details in W = Cx S^{3} If you know W and S on that road section, then you get Cx easily. In order to compute Cx I looked for a flat section of the road. That was not as simple as it seems. Issue was, as always, with data. There were a number of outlier position data, either because some value was missing, or because Dave did stop at that location. In the latter case, we were getting a series of points with null speed and constant elevation. They looked like a perfect flat road section as a matter of fact. Of course, nothing could be derived from these, and they have to be discarded. It took a while to detect and understand all outlier cases, see tip #1. Spending time there is a good idea, as you do not want to remove data unless you really understand why this data isn't useful, see tip #2. Once outliers were removed, elevation was almost never constant from one location to the other. Indeed, GPS data is only accurate to few meters. A variation of few feet from one point to the next one could very well be to the limited accuracy of GPS positioning. A simple cure for this kind of noise is to smooth data. I smoothed elevation data by taking a moving average over 60 seconds. I also smoothed speed and power which were quite noisy as well. Once smoothed, it was rather easy to spot a flat road section with an almost constant speed and power. From that I derived the air coefficient. One thing that I noticed when looking at the smoothed data was that power wasn't constant at all! Here is a plot of power during that 5.5 hours period. Power as function of time (raw data) And here is what we got with a moving average of 60 seconds. Power as a function of time (smoothed) More sophisticated smoothing techniques led to similar plots. There is a downward trend, but the curve still looks pretty erratic. Our assumption #3 regarding power was clearly wrong. Given that we promised our first predictions to Dave's team same day, I had one day of work to build a predictive model for power. And I had no clue about where to start! A bit scary. This may be where my job differs from an academic job, see tip # 8.
I was facing what is known as a supervised learning problem. I knew what I had to predict, namely power, and I had about 20k samples. I started looking at possible correlations between power and the other variables, namely latitude, longitude, and elevation. Note that I could not use speed. Indeed, if I predicted power from speed, then I would simply move my problem: how would I predict speed in the first place? I did all the usual data exploration tricks, including scatter plots of target variable with each of the other variables. No correlation appeared. Next logical step is known as feature engineering. Could I derive new variables that would make hidden correlations more visible? I tried several. In particular I tried to plot power as a function of slope. Here is what I got. Power as a function of slope (raw data) Not very conclusive. However, this shows some interesting outliers. There is a point with slope close to 2, i.e. 200%. This is clearly wrong. No road can have that slope. Rather than arbitrarily cutting off outliers, I decided to use smoothed versions of the data, i.e. rolling average over 60 seconds for power and all the variables. This time I did find some interesting correlation. Here is the most significant one:
Power as a function of slope (smoothed) There was clearly a linear correlation! I then applied tip #4, i.e. started with a linear regression. Here is what it gave: Linear regression of power as a function of slope (red line) Pretty interesting isn't it? All I had to do was to use the equation of the red line to compute power as a function of slope. Here is my revised trajectory prediction algorithm:
This time, the predictions were pretty good, with at most 3% deviation compared to reality. It means that over 24 hours, our predictions where never more than 40 minutes apart from reality. That was a pretty cool result IMHO. It essentially shows that Dave had a pretty consistent way of using his energy. He was adapting his power to the slope of the road. Note that we did retrain our predictive model (step 2 of the revised algorithm) each time we got a new Garmin dump. This enabled us to take into account the increasing tiredness of Dave as the race was going on. Our model was deemed to be useful, even if wrong, as in tip #10. One way we used it was to help Dave decide where to rest. Indeed, the location where Dave rests impacts the time to completion of the race. Why? Because of changing weather conditions. Let us use a very simple example for the sake of clarity. Let’s suppose that Dave wants to decide whether to rest now or four hours from now. Furthermore, assume that there is currently no wind but that there will be a storm two hours from now that will result in strong tailwinds for four hours. Let’s look at two possibilities:
The blue curve is Dave’s speed as a function of time. In the first scenario, Dave rides four hours with a tailwind, all at high speed. In the second, he rides two hours without wind, at a lower speed, before riding two hours with a tailwind. The first scenario clearly yields a better average speed. Although it is easy to choose when to rest in this simple situation, picking the right time during the race was much harder. For that we used our trajectory prediction algorithm as a simulator. For each possible rest location we computed the time to finish the race. The location for which that time was minimal was the recommended rest location. Note that this is optimizing one rest location. There are more rests, given there is roughly one per day. However, one can show that optimizing each rest can be done in sequence without losing optimality. We therefore focused on optimizing the next rest. We did provide Dave and his crew additional information beyond the best pred We also took tip #9 into account, i.e. the need for nice presentation. We plotted the time to completion as a function of where the rest occurs. We then rendered this curve on a dashboard that Dave crew could access via an ipad. Here is a snapshot.
Rest recommendation dashboard. Blue curve is the time to destination depending on rest location. Red curve is the elevation along the route. In this case the best rest location is to rest at 12 am or a little after. That concludes my little story. I hope you enjoyed it. The above is just one piece of the overall project we did with Dave. You can find more on how Dave and IBM collaborated in Analytics For The Perfect Rac
