In my previous couple of blog entries, I used triangular distributions for examples. For many who suffered through (or maybe enjoyed) their stat classes (what are the odds?), this might be a surprising choice. They were taught the default choice would be a Gaussian distribution. For those more attuned with modern business analytics, they are likely to be familiar with triangular distributions. In this entry, I'll briefly the reasoning beyond each of them.
First, as you hopefully recall, both are distributions associate with random variables (Those who don't recall migh benefit from the series of tutorials at The Khan Academy
site). Each are non-negative functions with integral (area under the curve) one. (There are fancier mathematical definitions, but no matter.) Each describes the likelihood of each of set of possible outcomes of some random variable. The difference in shape between Gaussian (aka Normal) and triangular distributions reflects the nature and use of the random variables.
Briefly, normal distributions
are often arise as the histogram
of a set of measurements. They have some central value (called the mean) and some dispersion (called standard deviation) around the mean. Anyone who took a stat class studied these distributions. They show up in a many contexts:
- The distribution resulting from tabulating the histogram of repeated, but imprecise measures of some quantity and then divided the entries by the sum of the measures is often assumed to be normal. The mean of the distribution is the estimator of the actual measure.
Statisticians like the normal distribution for several reasons. First, it is easy to parameterize. If you know the mean. mu (μ), and the standard deviation, sigma (σ), you have completely characterized the distribution. For example, the likelihood of a measurement occurring is often characterized as being within some number of σ's from the mean. Figure 1 shows how this works.
The likelihood of a value falling in a range is given by the area under the curve. For example, the probability of a value of the normally distributed random variable falling within one standard deviation of the mean is 68.2%.
Normal distributions have one really cool feature called the Central Limit Theorem
, which states that under remarkably general conditions, the sum of a set of random variables will be close to normal. Notice, in the previous blog entry, when we added two triangular random variables, the sum appeared smooth and in fact started to look normal.
All that said, I do have have a pet peeve. Normal distributions are overused. Most things in nature and economics are not normally distributed. For example, as as documented in Wikipedia
, these phenomena are nowhere near normal, but are closer to a Pareto distribution:
- The sizes of human settlements (few cities, many hamlets/villages)
- File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones)
- Hard disk drive error rates
- The values of oil reserves in oil fields (a few large fields, many small fields)
- The length distribution in jobs assigned supercomputers (a few large ones, many small ones)
- The standardized price returns on individual stocks
- Fitted cumulative Pareto distribution to maximum one-day rainfalls
- Sizes of sand particles
- Sizes of meteorites
- Areas burnt in forest fires
- Severity of large casualty losses for certain lines of business such as general liability, commercial auto, and workers compensation.
Getting back to our topic, let's turn to triangular distributions. They are not used to describe a set of measured outcomes from an experiment. They are used to describe what we know or believe about some unknown random variable.
For example, the sales of a new product one year after delivery generally can not be determined by measuring the sales of a bunch of new products. As pointed out by Douglas Hubbard
, treating the future sales as a single fixed variable is unreasonable (although all too common). What is more reasonable is setting the low (L), high (H) , and most likely (E) values of the future sales. As I wrote in an earlier entry
these are the values that specify a triangular distribution. I.e. triangular distributions are set to zero below a given low value, L, and above the high value, H, and peaks at the expected value E. The distribution is then a describe be a triangular curve so that the total area is 1. Here is the distribution for L = 1, E=6, and H=7.
Some would argue there is a 'real' distribution of the future sales random variable and it is unlikely to be triangular. My response is for all practical purposes, it does not matter. The triangular distribution is a good-enough approximation to whatever the real distribution might be. By 'good enough' I mean they may be used to support decision making: they are a big improvement over using single values. They are also practical as they easy to specify and there is no assumption of symmetry, No wonder they are common in business analytics.
To wrap up, normal distributions are occasionally useful to describe outcomes of measurements while triangular distributions are useful for giving rough estimates of one's belief of the liklihood of outcomes based on the evidence on hand. More generally, normal distributions are useful in frequentist
statistics and triangular in Bayesian
statistics. See this Wikepedia article for a discussion of the kinds of statistics.
Much of what we do in development analytics is more Bayesian than frequentist. I hope to write more about that in the near future.