Green dice are loaded (welcome to phacking)Did you know green dice are loaded? I am not making it, I designed a rigorous scientific study that proves it. Let me repeat slowly: painting dice in green make them loaded. You don't believe me? Let me describe you the study and you'll be able to reproduce my scientific results. The studyI made a hypothesis that green dice are loaded and that they will yield a six outcome more often than usual dice would. I then took 100 green dice, and rolled them 10 times each, then I counted how many times a six appears. A six appeared 188 times, which is rather high. Indeed, one would expect that the number of a six outcome to be closer to 1,000/6 = 167. It seems that my hypothesis about green dice may have some truth in it. Scientists have devised ways to quantify what I called 'some truth' in the previous sentence. The most common one is called pvalue. It is the probability to get a result at least as deviant from normal that the one we got if we make no hypothesis. There are other approaches than pvalues for assessing experimental results, but discussing them is beyond the scope of this post. For the sake of clarity, let's compute the pvalue for my experiment. The pvalue is the probability to get at least 188 times a six out of 1,000 dice rolls if the dice do not favor any of the possible outcomes. This probability is about 4% (see the math section below for details), i.e. 1 odd out of 25. A low pvalue such as 4% is telling us something interesting: if we make no hypothesis about our dice other than they do not favor any outcome, then the result is rather unlikely. This is deemed sufficient by many scientific communities to conclude that the hypothesis (green dice are loaded) is true. I should therefore publish my result. Is it that simple? Are low pvalues good enough?A pvalue of less than 5% is indeed strong enough to get published in many scientific journals. It seems however that this led some researchers to tweak their data and their results to get a pvalue small enough whatever data they have. This is called phacking. It is illustrated by the xkcd comic below.
Examples of phacking are not that hard to find. For instance, a science journalist called John Bohannon managed to publish a st When are pvalues good enough, and when are they misleading? Correlation isn't causationThere is a trap that many fall into when it comes to pvalues. Let me illustrate it with the dice experiment. The pvalue does not say anything about the hypothesis (that green dice are loaded). It just says that a result as far from the mean of 167 six per 1,000 roll is unlikely if the dice do not favor any of possible outcomes. That's it. Don't read more than that. Said differently, nothing in the experiment says anything about our hypothesis: The 4% pvalue does not validate the hypothesis that the color of dice has an influence on dice roll outcome. Right, there is an unlikely result, but there may be plenty of reasons to explain this result. It can still be due to chance, even if unlikely. Maybe the assembly line had some issues when producing this dice lot. Maybe they were all stored the wrong way, and got tweaked. Maybe we did not interpret the experiment correctly. Who knows? It is a bit disturbing to not be able to imagine how dice color would explain that six is a more likely outcome. That should be the warning signal: if we cannot figure out a plausible reason why the color of dice influences the way it rolls, then we don't have much to publish. The need to have evidence of a causal relationship between the hypothesis and the observed result should be mandatory. This is gaining traction: for instance, as written above, some journals like Basic and Applied Social Psychology won't publish findings based solely on pvalue anymore. Still, I did get an unlikely result in my experiment. If this can't be explained by dice color, then what explains it? Let's investigate further. For that we need to look under the hood and check all the experiment settings. The catchWhen looking at what I did, I must confess I actually forgot to tell you about a seemingly minor detail. I tested 20 different group of dice with various colors, not just green. For each color, I gathered 100 dice, and rolled each of them 10 times. Here are the results I got for each dice color:
We see that only green dice had so many six. The probability to get that high a result for green dice is low, isn't it? Remember, I computed it to be 4%. As we are about to see, my computation of pvalue is plain wrong unfortunately. Indeed, running 20 dice experiments is different from running one. What I have done is this:
The last step is wrong. Indeed, a pvalue of 4% corresponds to this other experiment:
Isn't the second experiment the same as the first one? No, it isn't. There is a major difference. In the first experiment, I selected the dice color after the experiment was run. In the second experiment, I selected the dice color before running the experiment. As a matter of fact, the first experiment should be described this way:
It so happens that the pvalue is about 56% in this case. This is 11 times higher than the 5% threshold used for pvalues in general. Nothing much surprising therefore as the result was quite likely after all. Foregone is my scientific discovery! PhackingI have been sin of pvalue hacking (phacking) when I first reported our experiment results. Indeed, I reported results for one color without mentioning I tried 20 colors all in all. I did exactly what is depicted in the xkcd comics below.
Credit: xkcd The fact is, the more colors we try, the more likely a result deviant from the norm. In general, the more features you measure when performing an experiment, the more likely you'll get some values far apart from the norm. This is now a wellestablished fact, see John D Cook's in The I described one way of hacking pvalues, but there are many more. Should we avoid using pvalues given they can be hacked in many ways? The cureBanning the use of pvalues may be a bit extreme. After all, low pvalues tell us something interesting. It is just that a low pvalue alone isn't strong enough to warrant a scientific discovery. And we must make sure pvalues are computed correctly, i.e. using all the data that was used, and using all the experiment results that were produced. Issues with pvalues have led the American Statistical Association (ASA) to recently issue six principles about pvalues:
I violated several of them in my account of the dice experiment. I jumped to the conclusion that dice color was relevant solely based on a pvalue, violating principle 3. I did not report the exact experiment I conducted, violating principle 4. I did not found any evidence that color was relevant, yet I jumped to conclusion based on pvalue, violating principle 6. Understanding and following these six principles should help use pvalues for what they're worth, and not more. My take is that pvalues are useful if they are computed correctly, and if one does not read more into them than what they are. For more on these principles, see this interview of Ron Wasserstein, ASA’s executive director. I also recommend Arthur Charpentier's pha In the remainder of this post I provide Python code for running the experiment, then I provide details on how to compute the various pvalues we used so far. Experiment detailsHere is the Python code that we used to run the dice experiment. The code used in this post is available on github and nbviewer. import numpy as np import pandas as pd from numpy.random import random_integers, seeddice = ['Purple', 'Brown', 'Pink', 'Blue', 'Teal', 'Salmon', 'Red', 'Turquoise', 'Magenta', 'Yellow', 'Grey', 'Tan', 'Cyan', 'Green', 'Mauve', 'Beige', 'Lilac', 'Black', 'Peach', 'Orange'] def dice We fixed the random seed for reproducibility, but we could have used other seeds. Remember that we have more than 50% chances to get at least 188 six out of 1,000 rolls for at least one color in this experiment. It means that if we do not set the random seed, and run the above code, we are likely to get what we look for. Let's confirm this by running the experiment 1,000 times: def get_ Running the code yields 0.556, i.e. 55.6%. It means that we get the result we want in 556 of the 1,000 experiments. The odds of getting the result for the green dice is way lower. Let's confirm this experimentally as well, by running again the experiment 1,000 times. def get_ This yields 0.041, i.e. 4.1%. We only got the desired result 41 times among the 1,000 experiments. For the fun, I wanted to use the exact same colors as in the xkcd comic, hence I looked for random seeds that led to it. The code is provided below. def find_seed(dice, k, n, repeat, rounding): nseed = 0.0 for seed in rang Note that this is yet another phacking instance: I selected the random seed based on the result we wanted to show. This is really bad phacking, as I define first the pvalue, then find experiment settings that produce it. This dice experiment design is really not a model to follow! Some mathLet's now compute exact pvalues for our two experiments. In order to get exactly k times a six among n, we must first decide where those six appear in the sequence of 1,000 rolls. There are exactly n choose k = n!/(k!(nk)!) ways to do it. Then for each of these sequence, the probability that all of the k occurrences are a six is 1/6^{k} and the probability that the remaining nk occurrences are not a six is (5/6)^{nk}. The probability to get exactly k times a six out of n dice rolls is therefore equal to 1/ from scipy.misc import comb def proba_k_among_n(k, n): p = 1/6 q = 1p result = p**k * q**(nk) * comb(n, k) return result From it we can easily compute the probability that at least k out of n occurrences are a six, by summing the probabilities for each of the possible result. def proba_one_color(k, n): proba = 0.0 for i in range(k, n+1): proba += proba_k_among_n(i, n) return proba print("%0.3f" % prob Running it yields 0.040, i.e. 4%. This is close to the experimental result we got. The probability that at least one of the colors gets 188 times a six is the complement of the probability that no color gets at least 188 times a six. def prob Running it yields 0.559, i.e. 55.9%, which is close to what we got experimentally. The code used in this post is available on github and nbviewer. Update on march 23, 2016. A redd from scipy import stats Probability that a given color has at least 188 six:
Probability that one color among 20 has at least 188 six:
This uses the builtin binomial distribution from the scipy package. It provides the builtin probability mass function: binom.pmf(k) = choose(n, k) * p**k * (1p)**(nk) It also provides additional methods. The call to sf(187) gives the probability that one given color gets more than 187 six. The call to cdf(187) gives the probability to get less than 188 six.
