# Filling in the blanks: The math behind Nate Silver's "The Signal and the Noise"

How probability theory and Bayesian analysis apply to predicting on-time delivery of a software project

In 2012, after his triumph of predicting the outcome of the last two presidential elections and selling his "fivethirtyeight" blog to the New York Times, Nate Silver accomplished what is almost impossible. In his recent book The Signal and the Noise, he correctly describes the discipline of making predictions, without explicitly invoking the math. He accomplishes this feat even though the prediction methods he describes require more than one kind of mathematics. By leaving out the math, he has reached a broad audience with a compelling book with lots of examples.

Those who studied statistics and probability in college are perhaps left wishing for more. I've written this set of notes to address that much smaller set of readers: Those who have the ability to understand the math concepts and how the math provides the foundation for Silver's work.

To that end, I identified these major mathematical topics underlying his Silver's analysis:

• Applied probability theory including Bayesian analysis and inductive reasoning
• Long-tailed distributions
• Signal analysis (hardly a surprise, given the title)
• Nonlinear dynamical systems and chaos theory

These topics are generally taught in more advanced undergraduate or maybe first year graduate courses. Each topic has its own mathematical literature. Nevertheless, each topic has just a few elementary core principles that are relevant to Silver's book. My intent is to delineate these principles for those readers who want the blanks filled in but have not necessarily had the math.

In particular, I find that many in my domain, the management of software and systems development, are in the target audience; they have enough math to understand the ideas in the paper, but lack specific training in predictive methods. It is this audience to whom these articles are explicitly targeted. So, I will give focus in that direction. However, I hope and expect that the article will have general appeal.

This is the first of four articles, each of which will cover one of the topics listed above. This article covers the essentials of the underlying probability theory and inductive reasoning. I will assume that the reader has had first-year calculus and has some recollection of sets and high school-level probability theory.

My challenge is to explain the math well enough to build intuitions but not distract you with the formalism or drown you in the extensions and details of the concepts. To that end, I have chosen examples designed to help build the intuition.

My goal here is to achieve a level of explanation suitable for the target audience. I strongly encourage you to tell me how successful this article is in achieving that goal and to make suggestions.

## Samples, events, and probabilities

This quote might be more famous than its author, and I can't discover who said "Predicting the future is impossible, but that is our job." I've long remembered this, because so much of what we do involves prediction, making a sort of bet. If we agree to produce something our bosses need when they need it, we are betting that we will succeed. A candidate who invests her fortune and emotional energy in a campaign is betting she will win. Stock market investors make lots of bets. The disciplined part of prediction is measuring how good the bet is.

A prediction generally is a statement about some future event. Examples can be very far ranging and include the measurement from a controlled experiment, the completion date of a development effort, the behavior of an electorate, the rise in cancer rates due to smoking, stock prices, and the number of earthquakes that will occur over the next five years. In some cases, the techniques of prediction have also been more broadly applied to include the likelihood that fraud has occurred, or that an answer to a question is correct. For example, IBM's Watson computer that beat the Jeopardy champions used the ideas in this article by predicting the answer. In fact, the Watson display showed the likelihood of the answer being right.

To move prediction from the realm of the supernatural and into the realm of science, we need a broad framework to reason about our confidence in the ability to predict. For example, I can be so confident that the sun will rise and set as predicted by the solar tables that I would bet all I own with anyone foolish enough to take the bet. I am less confident that, because of global warming, storms in my region will continue to grow in intensity, but I am confident enough to invest in a generator.

So predictions should be accompanied by some statement of how probable they are, which expresses our degree of confidence in the prediction. The mathematical reasoning that governs this expression lies within probability theory. Like any kind of mathematics, we make a few definitions or axioms about the key concepts and take it on faith that the concepts correspond to real-world phenomenon. That leap from the real world to probability theory took centuries to work out. The concepts are not obvious, but, like all good math, they work remarkably well. In this article, I explain just enough so you can see how probability theory applies to real-world problems.

This might not be an easy journey, but it's well worth the effort. Take it slowly, absorb the concepts, think of your own examples, and ask questions of anyone whom you predict will know the answer.

## Fundamentals of applied probability

To get started, as with any mathematical theory, we need elemental concepts upon which to build. Geometry works by postulating things called points that behave according to certain "self-evident" axioms. Similarly, in probability, samples are the primitive elements. Samples can be picked from some set, for instance an individual selected from some target population for a survey; from an item picked from a product assembly line for quality testing; or from a temperature measurement taken at a given location.

All we really need to know about samples is that they can belong to sets. A set of samples is called an event.

We call the set of all samples U, the universe. It is not really the Universe, but rather all of the samples relevant to the problem. The events are subsets of U. Calling a subset an "event" might be surprising, but get comfortable with that usage. Using mathematical notation helps us be concise and precise, and that builds your intuitive ability to internalize the concepts.

Repeat this often:

In probability, an "event" is not an occurrence in time or space. An event is some set of samples that describe, say, different outcomes from a measurement.

A probability measure is a function that assigns to each event A in U a real number P(A). There is a more precise mathematical definition of a probability measure within the field of measure theory, which specifies which subsets of U have a probability. In this article, we will be as precise as necessary, but no more precise. For all practical purposes, any subset that you will ever come across can be assigned a probability measure. Additional properties of probability measures are listed below.

More intuitively, the probability P(A) of event A is the likelihood that a sample chosen randomly from U is also in A. For that reason, P(A) is sometimes written as P(y A). Recall that y A means y is a member of the set A.

Also, the terminology might be confusing to some, because the distinction between samples and events might be glossed over in introductory statistics course. These courses discuss discrete universes, such as the rolls of a die. There are six samples for the roll of a die, so the universe is the finite set U = {1,2,3,4,5,6}. You (and some introductory texts I have seen) might be tempted to say that P(6) = 1/6, but strictly speaking, this is not correct, because "6" is a sample, not an event. So, instead, we should write P({6}) = 1/6, expressing the probability that the roll will be in the event "rolling a six." That distinction might be too fussy for high school, but the formalism allows for the probability definitions and their usages found below.

Let's take a simple example. Suppose that you have a tub of a thousand red and blue balls, and your job is to determine the probability of picking either a red ball or a blue ball. To set up the problem, we define U to be the set of balls. Each ball is a sample. Let R be the set of red balls and B the set of blue balls. R and B are events, corresponding to drawing a red ball or a blue ball. We would like to know P(R) and P(B), the probability that the drawn sample will be in the event R or B. If we know the proportion of red and blue balls in the tub, the computation is easy:   Notice that P(R) + P(B) = 1.

If you knew the probabilities, you would know how to predict what ball would be drawn next well enough to set odds and make a bet. For example, if you knew there were 900 red balls and 100 blue balls, a fair bet o n drawing a red ball would be nine to one. How to infer those proportions from taking a number of samples is discussed later.

Because P acts on sets, let's review definitions from set theory.

##### Figure 1. Basic set theory concepts used in probability  In a universe U, let A and B be subsets or events. Then:

• A B, the union of A and B, contains all of the samples in A or B.
• A B, the intersection of A and B, contains the set of samples that are both in set A and B.

This is shown in Figure 1. To continue:

• We say A is a subset of B, A B, if every event of A is event of B.
• The empty set is denoted ; it has no members.
• Finally, for A U, the complement of A, denoted ~A or Ac, is the set of samples of U that are not in A.

The assignment of probabilities to events follows certain fairly self-evident rules, much like areas of planar regions. Here are some rules:

• For any A, 0 ≤ P(A) ≤ 1. (Probabilities fall between 0 and 1.)
• If A B, the P(A) ≤ P(B).
• P( ) = 0. (The probability of nothing happening is zero.)
• P(U) = 1 (Something in our universe is sure to happen.)
• P(Ac) = 1 – P(A).
• P(A B) = P(A) + P(B) – P(A B) (See Figure 1 to see why.)

Also, we define the joint probability of A and P(A, B) to be P(A B), the probability of a sample being in event A and B. Note P(A, B) = P(B, A).

Finally, we say A and B are independent if P(A B) = P(A)P(B). An example of two independent events is, say, throwing two dice. The A might be the event of throwing a six on the first throw and B might be the event of throwing a 3 on the second. These are independent events with a joint probability 1/36.

### Discrete and continuous events

There are two kinds of events:

Discrete events
In these, the universe is a finite set of samples. Those who are more mathematically aware would define a discrete probability as one defined on a countable set, such as the integers. To build intuitions, we will restrict ourselves to finite sets.)

Continuous events
These arise when the samples take a value in some range (possibly infinite) of real values.

Using the continuum allows us to describe measurement at any level of granularity. You might say that a predicted delivery date is discrete, because days are discrete and you don't care about the nanosecond of delivery. My answer would be that leaving "the time to deliver" as a continuous set of samples often makes the math easier and leaves open the choice of granularity for discretization to cover all contingencies.

You probably studied the probabilities regarding discrete events in high school. (If not, there is lots of good online material available such as the Kahn Academy site.) You did this by counting things like the number of ways to roll seven on two dice and defining the probabilities as ratios.

Probabilities on continuous events works much like weight and physical density. So let's start by imagining that we have a planar sheet of metal. Imagine further, we can isolate areas on the plane and weigh each of the areas separately. On our sheet, the weight might not be evenly distributed. Weighing a square inch on one part of the sheet might lead to a different measurement on a different part. So even though A and B have the same area, they might have different weights.

Notice that weight functions do not act on points of the plane, but on subsets. The analogy to keep in mind is this:

• Samples are like points on our plane
• Events are like regions on our plane
• Probability of events is like the weight of regions

We can carry the analogy further with densities.

The density of a area A on the sheet is the ratio: d(A) = weight(A)/area(A).

We now borrow ideas from integral calculus, including limits. For example, note that in physics, the weight of a subset consisting of a single point would be 0. However, the density of a point is not zero. It is found by taking the limits of the density of a set of regions converging to the point. In this case, both the weight and area go to zero, but the limit of the quotient might not.

In this way, we can define the density of a point as the limit of the density of the surrounding areas. From calculus, we know, under very broad and reasonable assumptions, that the density of a point is well-defined. This limit defines a density (or mass) function d(x) that assigns the limit to each point on the sheet.

In physics, using Integral calculus, we say: In probability, rather than reason about sets of physicals points, areas, their weight, and the density function, we reason about sets of samples, events, their probability, and the distribution function (sometimes, reflecting the analogy with masses from physics, the distributions is called the probability density or even the probability mass function).

So, just as we integrate the mass density over a region on our sheet, we define for a sample s its probability density d(s), with an integral over an event (a set of samples) that is the probability of that event. By definition, then: Let's continue our example about predicting when a product might ship. We imagine a range of ship times and assign a density to the times. We do not really want to reason about the ship time to the second, even though we may use a continuous density, setting the granularity to the day is good enough. Note that

d(s) > 0 for all s

and The advantage of the density formalism is that it works for both the continuous and the discrete case. If U has a finite set of sample values, we can say that we have a discrete distribution. Again, to aid in building your intuition, consider the sum that we get from throwing two dice. In this case, U = {2, 3, 4, 5, 6, 7, 8, 9, 10. 11, 12}. For each sample, s U, d(s) = P({s}). So by counting the numbers of ways of rolling each value, we get:

d(2) = 1/36
d(3) = 2/36 = 1/18
d(4) = 3/36 = 1/12
d(5) = 4/36 = 1/9
d(6) = 5/16
d(7) = 6/36 = 1/6
D(8) = 5/36
d(9) = 4/36 = 1/9
d(10) = 3/36 = 1/12
d(11) = 2/36 = 1/18
d(12) = 1/36

In the discrete case, the integral is simple summation over the samples. For example, consider the probability of rolling a perfect square. So the event is S = {4, 9}. Then:

P(S) = P({4}) + P({9}) = d(4) + d(9) = 7/36

Notice that the sum of the densities over all of the events is indeed 1.

So far, we have quickly reviewed probability theory, the underpinnings of making predictions. I expect that for most readers, the material is a review of ideas that you last saw in high school or as an undergraduate. Perhaps seeing it again gives you new appreciation of how well the formalism matches our intuitions about chance. In any case, we have introduced formalism that works, guides calculations, and helps build our intuitions for understanding predictions.

### Random variables

For many prediction problems, you reason about the value of a variable in the presence of uncertainty. The value to be predicted might involve a future quantity, say upcoming election results, or an unknown quantity about which you can only take imprecise or incomplete measurements, such as a measure of some attribute of a population based on measuring a sample. In these cases, even though we might be interested in a single value, we cannot specify that value with complete certainty. Instead, we treat the quantity by describing the possible values that it might take, along with the probability density for each possible value. This what we mean by a random variable.

More precisely, a random variable b is a sample space U of all of the possible values that b might take and a distribution on U for the finding the probability of b falling in a subset of U. Just as events can be discrete or continuous, so can random variables.

For example, the random variable that describes the number of sixes expected in four throws of a single die has sample space {0,1,2,3,4}. Its density, d(n), is being the probability of getting n sixes for n=0,1,2,3,4.

For continuous random variables, U is the real numbers or some interval of real numbers and the probability distribution is some nonnegative function that behaves well enough that it has an integral (for a more precise definition, see Taboga), and that its integral over U is 1. A familiar example is a Gaussian (also known as normal) random variable as one that has a universe that is the real numbers and a distribution with the familiar bell-shaped curve (Montgomery & Runger, 2003). By the way, this is the formula for the Gaussian distribution, where is the mean and is the standard deviation: ##### Figure 2. A normal distribution  In lots of interesting cases, the prediction problem involves finding a useful distribution for one or more random variables. For example, the time until the next magnitude 6 earthquake in Japan is a continuous random variable. The challenge is to figure out which one. It is not really known, but it is certainly some form of a long-tail distribution as opposed to a normal one. The distribution in Figure 3 might be more descriptive. In this case, the density describes the probability that an earthquake won't occur in the time period given by the X axis in, say, tens of years.

##### Figure 3. An exponential distribution  ## Multidimensional sample spaces and marginal probability

Unlike the outcome of dice throwing, most interesting real-world random variables depend on more than one parameter. For example, the random variable describing an individual's life expectancy depends on age, percent body fat, cholesterol level, family history, participation in risky behaviors, and so on. In this case, the distribution of the random variable will be a function of several parameters, either real or discrete.

Let's work through a simple discrete example: Suppose that the designers of a new diagnostic test want to predict the likelihood of a patient having a certain disease. They run the following experiment:

They choose a large population of individuals to make up the universe of patients, give each individual (sample) the test, note the result, and follow each sample to determine if the disease is present, whether or not it was detected by the test. Each sample has two dimensions: 1) Test result and 2) actual Health. Then our random variable consists of the universe of test subjects and a density function of two parameters, Test Result and Health, where Test Result is either positive or negative, and Health is either afflicted or disease-free. We have four events: Afflicted, Disease-free, Tested positive, Tested negative.

Imagine that from measuring the ratios of observations, they found the following results, summarized as the discrete probability distribution in the following table.

##### Table 1. Likelihood of a patient having a certain disease
Test result HealthAfflictedDisease-freeMargin
Positive0.100.010.11
Negative0.020.870.89
Margin0.120.881.00

That is, they found in the sample population that 10% had the disease and tested positive, 1% were healthy and still tested positive, and so forth.

In this table, two related random variables emerge in the margins, one for Health and one for Test result. They have the same universe, but the distributions are found by adding the rows and columns. Not surprisingly, these are called the marginal random variable with related marginal distributions. In our example, we see that 88% of the population is free of the disease no matter how they test.

Generally, marginal random variables have the same set of samples as the original random variable, but they have different distributions, depending on one less of the dimensions. This is the formal definition of the marginal distribution:

Suppose that there is a multidimensional density function f:Rn [0,1].

Then for each i, 1< i < n, the marginal distribution fi is defined as:  As we will see in the next section, these marginal distributions are useful in finding conditional probabilities, which are at the heart of predictions.

### Conditional probability

Let's continue the medical test example. If someone tested positive, she would like to know how worried she should be — that is, what are the odds of actually being sick. Conversely, the test designer wants to know how likely it is that a sick person would actually test positive. Both of these are examples of conditional probabilities. Start by creating two events in our universe:

A = {set of samples whose Health = Afflicted}
B = {set of samples whose Test results = Positive}

A positive-testing sample is in event B. Someone in event B wants to know the likelihood that she is also in event A. You might think that is P(A B), but P(A B) does not account for the fact that we already know she is in B, so we can restrict our attention there. In that case, we can treat B as the universe, so we want to know the probability measure of the overlap of A and B inside of B. Going back to Figure 1, this is the ratio of the probability of the event A B to the probability of the event B, leading the definition of the conditional probability of A given B, P(A|B) as: This definition of conditional probability lies at the heart of prediction. It is worth taking some time to convince yourself that this definition matches your intuition about how conditionality should work. For example, if B were a subset of A, then A B = B, therefore P(A|B) = P(A B)/P(B) = 1. This meets our intuition that it if all Bs were also As, and a sample is B, then we can be certain it is also in A. If A and B were disjoint, then P(A|B) = 0, again matching our intuition.

For our test example, going to the table, we have the joint probability that P(A, B) = P(A B) = .1 and the marginal probability that P(B) is .11, so the conditional probability P(A|B) = .1/.11 = 91%. Therefore, a person with positive results should be very concerned and get more testing, but it is too soon to panic, because there is still has a 9% chance of being healthy.

The reader can check that P(B|A) = .1/.12 = 83%. Thus, the test designer can conclude that his test will detect 83% of those with the disease.

As mentioned above, the ideas of conditional probability and marginal probability are related. Often, as in the example, A and B are based on different dimensions of a multidimensional distribution.

This definition also hides an important rule:
P(A, B) = P(A B) = P(A|B)P(B)

Some authors define the independence of A and B to mean P(A|B) = P(A). This is logically the same as our earlier definition and helps intuition. It says if A and B are independent, then knowing something about B tells us nothing about A, as I will illustrate next.

The Monty Hall problem

To learn more about conditional probability, let's try the famous (in some circles) Monty Hall problem: Suppose you are playing "Let's Make a Deal" with Monty and are shown three doors. Monty tells you that behind one of the doors is a car (something you want) and behind the other two doors are goats (something you don't want). The doors are labeled 1, 2, and 3. You are given the opportunity to choose a door, so you make a choice. Monty then opens one of the two doors that you did not choose to reveal a goat. He then asks you to make a second choice: Stay with your original selection or switch to the other door? What should you do? Does it matter? This is a prototype of a kind of prediction model. You are being asked, based on all that you have seen and believe, which door has the car? Which door is the better prediction?

Before we go into the answer, note that most of us of us get it wrong at first. It is hard to see how Monty opening a door changes the one-in-three odds of the car being behind any door. It turns out that a key insight is that the fact that Monty knows which door has the car and chooses to show you a goat. That important observation affects the odds.

The problem has symmetry; it is invariant with respect to whatever door you initially choose. Assume that you chose Door 1. We need to define the universe of samples. In this case, there are two dimensions to a sample. A sample consists of the pair (DC, DM) where

DC is the door with the car

DM is the door that Monty's chooses

However, you believe that Monty is following some strict logic:

• He never opens the door that you choose.
• He never opens the door with the car.

So we define U to be {(1, 2), (1, 3), (2, 3), (3, 2)}.

Let DCi be the event that car is behind Door i, for i = 1, 2, 3. Then:

DC1 = ({1, 2), (1, 3)}
DC2 = {(2, 3)}
DC3 ={(3, 2)}

Also, let DMj, for j= 2, 3, be that event that Monty chooses to show Door j. Then:

DM2 ={(1, 2), (3, 2)}
DM3 = {(1, 3), (2, 3)}

As far as we know, each door is equally likely, so we initially set:

P(DC1) = P(DC2) = P(DC3) = 1/3

Let's assume that Monty opens Door 2. What we need, then, is the conditional probability that the car is behind Door 1, given that Monty opens Door 2, P(DC1|DM2), and also the conditional probability that the car is behind Door 3, P(DC3|DM2)).

From the definition of conditional probabilities,

P(DC1|DM2) = P(DC1,DM2)/P(DM2).

So from simple algebra:

P(DC1|DM2)P(DM2) = P(DC1,DM2) = P(DM2,DC1) = P(DM2|DC1)P(DC1)

We get:

P(DC1|DM2) = (P(DM2|DC1)P(DC1))/P(DM2)

And similarly:

P(DC3|DM2) = P(DM2|DC3)P(DC3))/P(DM2)

If the car is behind Door 1, Monty can choose either Door 2 or Door 3. As far as we know, each is equally likely, so we set:

P(DM2|DC1) = 1/2

And by substitution, we get:

P(DC1|DM2) = (1/2)(1/3)/P(DM2) =(1/6)/P(DM2)

If the car is behind Door 3, Monty must choose Door 2. So P(DM2|DC3) = 1 and

P(DC3|DM2) = (1/3)/DP(M2)

Which means that:

P(DC3|DM2) = 2P(DC1|DM2)

You double your chances by switching doors! The car being behind Door 3 is the better prediction. You can try this at home using cards with friends.

This example has some very important features.

• The argument uses all of the implicit beliefs and explicit information of the problem. It precisely captures the narrative.
• In particular, we need to make an initial assumption about the probability of the car being behind a given door. In this case, we assume that P(DC1) = P(DC2)= P(DC3) = 1/3.

Lots of prediction problems follow this pattern: We have some initial beliefs about the probabilities of events, and then some new information comes along (such as Monty showing us the goat). Using that information, we update the probabilities. Using the argument given in the Monty Hall problem, we get the famous and critically important Bayes' theorem, which is what Nate Silver means when he refers to "Bayes reasoning."

We will get back to this. This example is well worth taking the time to fully understand.

## Bayes' theorem

Here is the famous theorem:

Given two events A and B in some universe: I'll leave the proof to the reader (you can find it in the Monty Hall example). It is a trivial consequence of the definition of conditional probabilities.

This theorem is fundamental because it gives the relationship between the elements of inductive reasoning: We are interested in the probability of B given the evidence A. In applying the theorem, P(B) is called the prior probability (or just prior), and P(B|A) is called the posterior probability. The prior is what the probability before we have the evidence A, and posterior is the probability after accounting for the evidence. So this theorem tells us how to update our beliefs based on the evidence.

It is truly remarkable that such a simple theorem can raise such a fuss. In fact, this theorem has led to hundreds of years of controversy (McGrayne, 2012). Understanding the controversy might help you appreciate the art of prediction. The reason for the controversy has nothing to do with the math, but with how the theorem might be applied and its philosophical implications for reasoning about predictions and knowledge.

It seems that the prior P(B) term gave some influential academics fits. They rejected the idea that belief should be based on anything more than the evidence. If P(B) is made up, why should I have any confidence in P(B|A) using the formula? One answer is that we might start with what seems to be a best choice of priors that reflects what we know or believe. In the Monty Hall example, we had no reason to expect any bias regarding which door hid the car, so we simply assumed that the priors were equal. If we had known that the folks who set up the game choose Door 2 twice as often as the others, we might have set our priors as P(D1) = .25, P(D2) = .5, P(D3) = .25. Some readers might want to redo the Monty Hall problem with these priors.

More to the point, our priors might reflect some sort of expert knowledge that lives outside the data. For example, in software development project estimation, the various parameterized models given you priors. Bayes' theorem gives you the opportunity to capture that expert knowledge and feed it into the process.

Notice that we applied Bayes' theorem in the Monty Hall problem. Can you find where? Often in reasoning about conditional probabilities, the theorem is used implicitly. In practice, you do not need to identify where you used the theorem to do Bayesian analysis. You can simply adopt the mindset about how the probabilities change with the evidence.

Finally, when we can gather enough evidence, eventually the impact of the evidence overcomes the erroneous prior assumptions. Silver calls this phenomenon "Bayesian convergence." It means that you and I might have very different prejudices about how the climate works, but these are overcome as the evidence for climate change mounts (Silver, 2012).

Coming up with good Bayesian estimates is tricky. There are lots of technical books about it, such as Doing Bayesian Data Analysis (Kruschke, 2011) or Data Analysis: A Bayesian Tutorial (Sivia and Skilling, 2006).

Here is a simple example that shows how both how random variables and Bayes' Theorem might be used.

### The biased coin example

Suppose we are given a coin and told that it might be weighted to favor heads or tails. More specifically, the coin has a bias 0 ≤ b ≤ 1, with b as the expected percentage of heads. A fair coin has b = 0.5. We are told to flip the coin and track the number of heads and tails and report b. The naïve approach might be to flip the coin 100 times and report b as the percentage of heads. If we get 90 heads, we might report b = 0.9. However, with the evidence of 90 heads in 100 flips, we are not sure. Even with a fair coin, there is some low probability of getting 90 heads. Applying Bayesian reasoning, the probability that the coin is fair after getting 90 heads is also low, but not zero.

To estimate b then, we will treat it as a random variable. We want to find its distribution, b(x). In this case, the sample space is the interval [0, 1], and an event is a subset of the interval. We are interested in the probability that r < b < s. We apply the formalism of conditional probability and Bayes' theorem. To get started, we need to extend our universe to include flips. We will let U consist of triplets {(k, N, b), 1 ≤ k ≤ N and 0 ≤ b ≤ 1} with N the number of flips, k the number of heads, and b the bias.

Using binomial distributions, we know (Montgomery & Runger, 2003) that the probability of getting k heads out of N throws with a coin with bias b is: If we know b, we can restrict the universe to the subset of samples with that b, and holding b constant to get the marginal conditional probability P({(k, N)} | b).

Let's now suppose that we flip the coin N times and get k heads. Note that the evidence (k, N) does not uniquely determine b, as the pair might occur with some probability for any b between 0 and 1. We are looking for each value of b, and its probability given the evidence of k heads in N flips.

To do that, we can use Bayes' theorem to find P(B|(k, N)), where (k, N) is the event of throwing k heads in N throws and B is some event in [0,1] to get:

P(B|(k, N)) = P((k, N) | B)P(B)/P((k, N))

This formula can be programmed easily by using, say, Microsoft Excel spreadsheet software. To use Excel, we turn this into a discrete problem by setting bn = 10-mn for n = 1 to 10m and calculate 2) using 1) and some choice of prior distribution for b.

To get started, we need some choice of prior distribution. A standard approach is to reason that we know nothing; therefore, all guesses are equally valid. That gives the uniform distribution and P({bi}) = 10-m for all i (see Figure 4). Or we could reason that most coins are fair, so our coin is likely to be fair, but we are not sure. In that case, we could set the prior distribution of b to be a normal distribution with a wide enough standard deviation to account for our uncertainty. As we will see, both priors converge to the same answer with enough evidence.

##### Figure 4. Uniform prior  Finally, notice that the denominator P((k, N)) is constant for each bi, so we can ignore it, looking at relative values (say, the shape of a graph). Or we can use the fact shown in this formula to find the denominator: In this case, we first calculate the numerators in formula 1), and we add the values to get the denominator.

Let's see how this works in practice. Following are some examples of finding the distribution for b for different values of k, N. In each case, 90% of the throws were heads. In the first case, we assume that the prior estimate is uniform.

Notice what happens when the evidence mounts. Suppose that we flip the coin 10 times and get 9 heads. Figure 6 shows the posterior distribution of b. Notice that although it peaks at .9, there is still a significant spread. It could be as low as .45. On the hand, we are now reasonably certain that the bias is not less than .4.

##### Figure 5. A posterior estimate for the biased coin problem with 9 heads out of 10 flips  Suppose we flip the coin 100 times and get 90 heads. As shown in Figure 6.

##### Figure 6. A posterior estimate for the biased coin problem with 90 heads out of 100 flips  Figure 7 and Figure 8 show that the distribution narrows further with more evidence. With 900 heads out of 1000 throws, we are justified in believing that .87 < b < .93.

##### Figure 7. A posterior estimate for the biased coin problem with 450 heads out of 500 flips  ##### Figure 8. A posterior estimate for the biased coin problem with 900 heads out of 1,000 flips  Notice that there is not much difference in the distributions going from 500 to 1,000 flips. For fun, we could look at what happens at 10, 000 flips.

##### Figure 9. A posterior estimate for the biased coin problem with 9,000 heads out of 10,000 flips  With 9,000 heads out of 10,000 flips, we can be very certain that bias is 0.9. Suppose that our prior belief was that the coin was likely to be fair, b = .5. The key word is likely. We are not certain, so we start by capturing our belief, including our uncertainty, with a normal distribution (also known as a Gaussian distribution). We will use a normal distribution with a mean equal to .5, our best guess, and a standard deviation of .1 to reflect our uncertainty. In this case, we will use the distribution in Figure 2.

Suppose that we get 9 heads in 10 throws as before. Then, as Figure 10 shows, we find that our prior belief — although within the realm of possibility — is looking improbable, because we find P((.45, .55)|(9,10)) is less than .25.

##### Figure 10. A posterior estimate for the biased coin problem with 9 heads out of 10 flips, using a normal prior  Maintaining this ratio with more throws (Figures 11-13), the distribution converges to the same distribution that we got with the uniform prior.

##### Figure 11. A posterior estimate for the biased coin problem with 90 heads out of 100 flips, using a normal prior  ##### Figure 12. A posterior estimate for the biased coin problem with 450 heads out of 500 flips, using a normal prior  ##### Figure 13. A posterior estimate for the biased coin problem with 900 heads out of 1,000 flips, using a normal prior  The convergence of the distributions as the evidence mounts is, as I noted earlier, what Silver calls "Bayesian convergence." This is a very real-world phenomenon. The idea is that no matter what your initial prejudice might be, with enough evidence, the truth becomes evident.

### Inductive reasoning, Bayesian reasoning

There is a definite Bayesian perspective that underlies the ability to make good predictions. Silver makes this point and, in fact, mentions Bayes exactly 100 times in the book.

It starts with the philosophical question of "What do we mean by a distribution, anyhow? " There are two common answers:

Frequentist
A distribution is what you get by taking repeated measurements from a controlled experiment. In practice, what you get from the measurements is a histogram that gives, for each measured value, the number of samples (the frequencies) that had that measurement. Then for each value, dividing the frequency by the total number of samples gives you the proportion of the measurements. That set of proportions is the distribution. The Frequentists go on to apply various statistics to find quantities, such as the most likely value of what is being measured.

Bayesian
There are some quantities about which we are uncertain, and there is no way to take repeated measures. Bayesians allow that such quantities might also be reasoned about by using random variables. Some examples include the likelihood of an earthquake of a given magnitude in a given timeframe, whether two planes will hit each other, or when a software development project will end.

The Bayesian answer allows us to reason about all sorts of things that the Frequentists would reject as being unsuitable for study. Equally important is that Bayesian methods allow us to draw conclusions for smaller amounts of data or when the data is hard to come by. For example, consider the problem of predicting the frequency of floods in a given region for setting insurance rates. The Frequentist would review years of historical records and count the number of storms by magnitude and build a table of "10-year storms" and "20-year storms," etc., which would be used to predict the frequency of storms in the future. A Bayesian approach might start by predicting the distribution of the frequency of storms of different magnitudes, based on the expected rise of temperatures. This distribution could be refined by studying historical data along with new data that comes year by year.

The Frequentist approach would give a poorer prediction than the Bayesian. However, the Bayesian approach is much harder to do. But I believe that the insurance companies are adopting the Bayesian approach in setting property damage policy fees. The same idea works for earthquakes: tectonic plates move, making the historical data less than predictive. The same ideas also apply to nuclear power plant failures (see my blog entry titled Estimating Nuclear Safety Is Really Hard).

Even though there are significant practical reasons for taking the Bayesian perspective (McGrayne, 2012) the Frequentists dominated the 20th century. Your stat class almost certainly consisted of Frequentist techniques, but the 21th century is seeing a resurgence of Bayesian methods (Kruschke, 2011 and McGrayne, 2012). This is a good thing because, as I discuss in the next section, Bayesian reasoning is the basis of good science and predictions in general (Jaynes, 2003 and Silver, 2012).

Let's get back to the problem of predicting on-time delivery of a software project. A Frequentist approach would be to treat our project as one of a population of development projects with similar attributes (number of story points, lines of code, complexity, number of staff members, and so on) and use the past data on how long it took similar projects to deliver. In the end, the Frequentist comes up with a "best estimate" — a point value — and then decides whether the project is a go or no go. The Bayesian starts with the perspective that the delivery date will be a random variable until the project ships. We then can use the team's experts to input the prior assumptions, using triangular distributions based on the best case, worst case, and expected case assumptions. (See Hubbard, 2010, and my blog entry, Normal and Triangular Distributions.) Then we can use the evidence of the project velocity and work item completion to get the posteriors.

Note that there are real differences between the approaches. As an example, in software development, we often talk of the "cone of uncertainty" (McConnel, 2006), which notionally shows that various software project parameters start out vague and should become more certain throughout the lifecycle.

• In the Frequentist approach, the goal of the estimator is to get the actual date correct at the beginning of the project. The cone of uncertainty, then, is the set of the differences of the estimated date and the real date, found at the end of the effort.
• In the Bayesian approach, there is no actual date to be estimated at the beginning of the effort, just the distribution of the time to complete the project. In this perspective, the cone of uncertainty is the width of the distribution of the random variable of the time to complete.

Generally, to be a good Bayesian, you must put the concept of "actual" out of your mind until the completion, and see all values as random variables.

### Inductive reasoning and predictions

In deductive reasoning, we learn that if A implies B, and A is true, then B is true. As we all know, mathematics is testimony to the power of deduction. Mathematicians write down a set of axioms for a branch of mathematics and see where deduction takes them. In deductive reasoning, it is also true that if A implies B, and B is false, then so is A. So, as long as B is true, there is some possibility that A is true. The question for scientists is to determine how likely that possibility is. Now suppose that A implies B1, A implies B2, … A, implies Bi, and B1 … Bi are true or at least highly probable. This is equivalent to saying that P(B1, …, Bi|A) is high. Using Bayes' theorem, we can find how probable A is given B1, …,Bi. In other words, we can find P(A|B1, …, Bi) . This is inductive reasoning.

Inductive reasoning is built on Bayes' theorem, which is why the theorem is so important. In the world of the Bayesian perspective (what Silver calls "Bayesland"), there are no complete certainties outside of the formalism of pure math, just more or less uncertainty measured by the statistics on some random variable.

This is in essence how prediction works. We use inductive reasoning to develop a theory based on evidence. When we have the theory, we use deductive reasoning to make predictions based on the theory. That is, if we determine that A is highly likely, and if A implies C, then we should also have confidence in C. For example, think about the fair coin example. Based on the evidence, we develop a theory of the coin bias. With that theory, we can make predictions going forward as to the ratio of heads and tails. This is the fundamental idea behind the various machine-learning algorithms.

Note that inductive reasoning is the core of the scientific method (Jaynes, 2003) Take climate change, for example. Based on models of atmospheric chemistry, scientists developed climate models (that is, theories) that use deductive reasoning to predict changes in the weather: If climate change were occurring, we would expect certain trends in global temperature readings, more variation in local temperatures, and more severe weather events. As each of these events has occurred, they have provided evidence for climate change that raised confidence in the theory.

Actually, the same works for Newton's theory of gravity (for human scale physics). The evidence for it was the precise measurements of the planetary orbits and a whole bunch of experiments. It is used now to derive how to steer probes to Mars. There is so much evidence that we simply take it as fact. Yes, gravity and climate change are both theories. One just has more evidence than the other, so we are more confident in one than the other. The difference is a matter of degree, not kind.

## Summary

Let's summarize all of this:

• First and foremost, predictions are based on understanding the likelihood of what might occur in the future. Hence, making predictions is an exercise in applied probability.
• How to legitimately apply probability theory has been a source of bitter argument for centuries between the Frequentist and Bayesian schools. Today, it is widely recognized that the Frequentist approach is too limited and the Bayesian approach is more useful.
• The Bayesian perspective on how to apply probability is much broader than applying Bayes' theorem, which is in fact a trivial consequence of the definition of condition probability. The Bayesian perspective allows you to reason about quantities about which you are uncertain and does not necessarily arise from repeated samples of measuring a defined population. Most modern predictive techniques, such as machine learning, are Bayesian in spirit.
• The practice of prediction starts with some initial theory of how a system might perform. The theory will have some uncertain parameters that are described by random variables. The theory could be scientific, raw expert opinion or the output of some neutral exploration of data. The prediction is some sort of measurement of the behavior of the system. You use inductive reasoning to determine to what extent we should believe the theory. If the theory implies something that turns out to be false, then the theory is invalid and is discarded. If the theory implies a bunch of facts that are true, then we used inductive reasoning, based on conditional probabilities, to gain confidence in the theory. In the course of this inductive reasoning, we use actual measurements of the system behavior to update the parameters as random variables. As the theory becomes more probable and the parameters have narrower variances, it can be used to make high-confidence predictions.

The biased coin example is a very simple instance. The system involves flipping a coin to see how many heads and tails result. The theory is that the system behaves according to a binomial distribution with a parameter, b. We treated b as a random variable and did apply explicit Bayesian refinement methods (going from prior to posterior distributions) to see that the theory did apply to get tight ranges on b. With this understanding of the behavior of flipping the coin, we can with confidence predict how the coin will behave in the future and maybe win lots of bar bets.

We can do the same with software project management. As discussed above, we might want to predict when a project is likely to be complete. In this case, the random variable would be time-to-complete. There are two kinds of evidence that we can use:

• Duration of previous similar projects
• The duration, dependencies, and status of project tasks

Although Bayesian techniques can be applied to either, the second class of evidence arguably accounts better for the performance of the given team working on the specific software content. So we start with a process model (agile, for example) of how the team will execute. This process will have a set of input parameters, including number and initial (prior) estimates of the effort of the planned features. These can be used with simulation techniques to give us the distribution of time-to-complete. This is the deductive step. As the work becomes complete and the tasks are better understood, this information serve as evidence to carry out Bayesian refinement and update the distribution to get, hopefully, a narrower prediction. This is the inductive step.

Therefore, predictive methods require both deductive and inductive reasoning. And inductive reasoning requires probability and Bayesian reasoning. No wonder people generally reason poorly about predictions. These ideas are far from obvious and took centuries to evolve.

In the end, the methods described above underlie how we understand the real world. It is the basis of scientific reasoning, drawing conclusions from market data, operational data, decision making in the face of uncertainty, and it should be the basis for public policy.

## Acknowledgements

I found explaining this material with the desired balance of rigor and intuition to be a challenge. I would not have gotten to this point in that quest without the help of patient readers of the previous drafts. In particular, I would like to thank Tim Klinger, Evelyn Duesterwald, Peri Tarr, and Walker Royce for their patient reading of the early drafts. I especially want to thank Michael Perrow for his patient and careful editing of this article.

Hubbard, D. W. How to Measure Anything: Finding the Value of Intangibles in Business. J. W. Wiley, 2010.

Jaynes, E. T. Probability Theory: The Logic of Science. Cambridge University Press, 2003.

Kruschke, J. K. Doing Bayesian Data Analysis. Academic Press, 2011.

McConnel, S. Software Estimation: Demystifying the Black Art. Microsoft Press, 2006.

McGrayne, S. B. The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy. Yale University Press, 2012.

Montgomery, D. C., and Runger, G. C. Applied Statistics and Probability for Engineers. J. W. Wiley, 2003.

Press, S. J., and Tanur, J. M. The Subjectivity of Scientisits and the Bayesian Apprach. J. W. Wiley, 2001.

Silver, N. The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. The Penguin Press, 2012.

Sivia, D. and Skilling, J. Data Analysis: A Bayesian Tutorial. Oxford University Press, 2006.

Taboga, M. Lectures on Probability Theory and Mathematical Statistics (2nd Edition). CreateSpace Independent Publishing Platform, 2012.