‘It’s the most wonderful time of the year.”
That song is supposed to refer to the holiday season in December, but for college basketball lovers, it also means March. Why? Because in the US, it’s men’s college basketball tournament time. There are several of these going on right now, but my interest is in the tournament that determines the men’s college basketball Division 1-A national champion. Sixty-eight teams are invited but only one gets to be the winner. My team, the University of North Carolina Tar Heels, has won this tournament 5 times, so I come by my enthusiasm naturally.
As part of what is now an American tradition, I’ve been completing tournament challenge brackets for fun for 19 years now. Sadly, I’ve only won a bracket challenge once, and that was in 1997. I won it because, back then, I had time and lived in a viewing area for three college basketball conferences, so I could watch a whole lot of games and judge with my own eyes how teams played. I’ve been watching basketball games since the 1970s, so I actually am knowledgeable about the sport.
I’ve gotten busier since then, so for the last 10 years or so, I’ve relied on analysts’ opinions. That has been a mistake. Most analysts’ opinions aren’t always based on intense analysis of a season full of games but are fueled by unintentional bias, history and hype. This year, I decided to try something else: Using statistics and analytics to pick my teams.
Picking the stats is easy; interpreting them is hard
There are three major college basketball sets of statistics: the RPI, Sagarin, and Pomeroy. Each has its merits, but I used Pomeroy as my source because the other two put a lot of weight on things like strength of schedule and subjective rankings where it’s assumed that, because a team has been a basketball powerhouse in history or has an experienced winning coach, beating that team or even losing to it is better than beating or losing to a team that doesn’t have either. Pomeroy also is somewhat subjective, but his statistics include ratings of part of the game, such as offensive efficiency, defensive efficiency, and tempo. Because the NCAA tournament is a “win or go home” tourney, how a team played over a long period over teams ranked higher or lower than them is not as relevant as the mechanics of the game in my opinion.
Watson Analytics to the rescue
A Pomeroy spreadsheet isn’t really that bad, but anything more than a few rows and columns makes my eyes glaze over. Rather than decipher the stats, I decided to download Pomeroy’s data from his website (I subscribe) and upload the spreadsheet into Watson Analytics. I knew that the explore and predict features of Watson Analytics would probably provide me with some quick insights that would help me pick my teams.
I uploaded the Pomeroy stats for the 68 teams in the tournament, clicked Predict and chose wins as my target. I soon realized that was a problem, because Watson Analytics told me that stats related to losses were the strongest predictors of wins. So, I went back and adjusted my data set by removing the win and loss columns and adding “W-Lratio,” in their place, which is “win-loss ratio.” I took the time to calculate that ratio, and I am glad, because with that adjustment, I hit pay dirt. I selected “W-Lratio” as my target, asked for a combination of factors and this is what I saw:
What Watson Analytics is telling me here is that offensive efficiency (OE) and defensive efficiency (DE) are two statistics most likely to influence my outcome, which is win-loss ratio. That made a lot of sense, because a combination of the most efficient offense combined with the most efficient defense is going to lead to more wins and fewer losses. Just to make sure I didn’t miss a better set of predictors, I clicked “View” all in the right corner of the screen above the first visualization for “What influences W-Lratio?” and got a screen full of other influencers. These were the top contenders:
The best offensive efficiency isn’t the best defensive efficiency?
There were other influencers with predictive strengths of 73%; however, familiarity with the Pomeroy rating system enabled me to see that the pure statistics of OE and DE as opposed to subjective statistics such as rank and adjusted versions of OE and DE were probably going to serve my purposes best. For a better picture of the prediction, I clicked the top graphic for details and got this chart:
As you can see, the darker the blue, the better the win-loss ratio based on OE and DE. Because this is an interactive graphic, I can click one of the squares to see the numbers for OE and DE and the resulting win-loss ratio. I got a good idea of the combinations that worked best, but decided to explore relationships a little better with Watson Analytics. I exited my prediction, clicked my data set and clicked Explore. Watson Analytics showed me some questions based on my data, but I knew what I wanted to ask, which was “What is the relationship of OE and DE by TeamName?” After I asked that question, I got this visualization from Watson Analytics:
This interactive visualization enabled me to click on a dot to see each team. I soon realized that the teams with the better W-L records were most likely to fall in the 114-118 range for OE and 94-98 range for DE. This was an interesting insight because I thought that the team with the best OE and best DE would have the highest W-L percentage, but this was not the case. However, it was close. So, after all that investigating, I began the process of picking my teams. I used the Predict detail graphic to compare the teams, but when I got stuck, I’d return to the data visualization to see where each team fell in the ranges for determining the best combination of OE and DE.
Another interesting tidbit: I’ve always heard that “defense wins games” (usually from fans of teams from the Big 10 Conference, because those teams are renowned for their defenses), but my data exploration and prediction proved that wrong. When compared head to head, teams with a high OE and average to fair DE had better win-loss ratios, overall, to teams with average to slightly better than average OEs but really good DEs (in the low 90s).
Sometimes it takes a little luck
With all that in my head, I got to work. In the spirit of honesty and full disclosure, it took me much longer than previous years when I used my gut, my bias and analyst opinions. Also, as I evaluated and chose my teams, there were times when some of the opponents numbers were so close that I could not rely completely on just OE and DE. In those cases, I used another Pomeroy statistic: luck, which indicates, in positive and negative numbers, the likelihood that a game will tip in a team’s favor when playing an evenly matched opponent. And finally, in one case and one case only, I went with my gut rather than statistics and that’s when I picked Xavier over Stephen F. Austin even though the stats said otherwise. Their stats were fairly close, but because I saw both teams play recently, I think Xavier has more talent and more tournament experience. After all, there are human beings playing these games–and they’re young ones at that.
The big reveal
With all that said, this is my bracket, which has been adjusted to show the winners of the two games played last night and two the night before.
Where you can get stats
For those who are interested in using stats or a combination of stats, here are the links to the three I’ve mentioned:
- Pomeroy (Pomeroy provides HTML stats on his website that include the “Luck” stat, I mentioned, but to get his full set of statistics, you have to subscribe to his site for a fee.)
- Sagarin (Sagarin’s statistics are only available in HTML format, it seems).
- RPI (This is the NCAA’s RPI site and is the one most free of ads. Again, it appears that HTML is the only format.)
Also, my printable bracket was downloaded from this website.
Where you can get Watson Analytics
Watson Analytics is available for free right from this website. Click here to register.