Glossary
A
asymptotic significance
There are some situations in which the distribution of a test statistic is not well
defined. Often, however, as the number of observations used to compute the statistic increases, its
distribution begins to approximate a known distribution. This well-defined distribution is then used
to calculate the significance value of the test statistic.
B
Bernoulli
The Bernoulli distribution takes values of 0 and 1. A Bernoulli variate takes the value
of 1 with probability equal to the specified probability parameter.
boxplot
Boxplots characterize the distribution of a variable, displaying its median and
quartiles. Special symbols identify the position of outliers, if any.
C
categorical
A variable with a discrete number of values; an ordinal or nominal variable.
Categorical variables are often used as grouping variables or factors.
cell
A cell is the cross-classification of levels from one or more factors. For example, if
you have customer factors for geographic region, marital status, and educational level, then married
college graduates in your northern sales territory constitute a cell.
central tendency
An attribute of a distribution concerning where the values of the distribution tend to
"congregate". Measures of central tendency include the mean, median, and mode.
correlation
Two variables are correlated if a change in the value of one signifies a change in the
other. The most common measure of correlation is the Pearson correlation, which measures the degree
to which the relationship between two variables can be described by a straight line.
Covariate
A scale variable that has been added to a model. In a predictive model, changes in the
value of a covariate should be associated with changes in the value of the target (dependent)
variable.
cut point
A cut point is used to separate cases into two groups, based upon whether values of a
numeric variable fall above or below the cut point.
D
dispersion
An attribute of a frequency distribution concerning the spread of the values. Measures
of dispersion include the variance, standard deviation, and interquartile range.
E
Euclidean distance
The distance between two points that is computed by joining them with a straight
line.
F
Factor
An independent variable defining groups of cases.
K
Kurtosis
A measure of the extent to which there are outliers. For a normal distribution, the
value of the kurtosis statistic is zero. Positive kurtosis indicates that the data exhibit more
extreme outliers than a normal distribution. Negative kurtosis indicates that the data exhibit less
extreme outliers than a normal distribution. The definition of kurtosis that is used, where the
value is 0 for a normal distribution, is sometimes referred to as excess kurtosis. Some software may
report kurtosis such that the value is 3 for a normal distribution.
L
layer
If rows give a table height and columns give it width, then layers give it depth.
level
The values of a factor are referred to as levels of the factor, or factor levels.
log transformation
In order to correct for non-normality, the natural logarithm can be applied to a
positive-valued variable. This technique is most effective when the non-normality is due to positive
skewness.
M
Mean
A measure of central tendency. The arithmetic average, the sum divided by the number of
cases.
Median
The value above and below which half of the cases fall, the 50th percentile. If there
is an even number of cases, the median is the average of the two middle cases when they are sorted
in ascending or descending order. The median is a measure of central tendency not sensitive to
outlying values (unlike the mean, which can be affected by a few extremely high or low values).
N
Nominal
A variable can be treated as nominal when its values represent categories with no
intrinsic ranking (for example, the department of the company in which an employee works). Examples
of nominal variables include region, postal code, and religious affiliation.
Normal distribution
The normal, or Gaussian, distribution is defined by its location (mean) and scale
(standard deviation) parameters. Its density function has a bell shape which is symmetric about its
mean. About 68% of the values of a normal variate will fall within 1 standard deviation of the mean,
95% within 2 standard deviations, and 99.7% within 3.
O
OLAP
An OLAP cube is a table of results summarized across several grouping variables, which
can then be manipulated or rearranged interactively. For example, you may have sales figures
summarized by geographic region, product type, customer type, month, and sales indicator (units
ordered, revenue, profit, and so on).
Ordinal
A variable can be treated as ordinal when its values represent categories with some
intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly
satisfied). Examples of ordinal variables include attitude scores representing degree of
satisfaction or confidence and preference rating scores.
outlier
An outlier is an observation whose value is distant from the values of the majority of
observations. It is sometimes more technically defined as a value whose distance from the nearest
quartile is greater than 1.5 times the interquartile range. Outliers pull the mean in their
direction, and should always be carefully examined.
P
pairwise
When computing a measure of association between two variables in a larger set, cases
are included in the computation when the two variables have nonmissing values, irrespective of the
values of the other variables in the set.
practical significance
A statistical test will answer the question, "Is there a difference between two
groups?" but not the follow-up "Is that difference large enough for me to care?" It is up to you to
determine whether test results are useful to your situation.
S
Scale
A variable can be treated as scale (continuous) when its values represent ordered
categories with a meaningful metric, so that distance comparisons between values are appropriate.
Examples of scale variables include age in years and income in thousands of dollars.
Sensitivity
A measure of the usefulness of a classification scheme. Sensitivity is the probability
that a "positive" case is correctly classified, and is plotted on the y-axis in an ROC curve.
1-sensitivity is the false negative rate.
Skewness
A measure of the asymmetry of a distribution. The normal distribution is symmetric and
has a skewness value of 0. A distribution with a significant positive skewness has a long right
tail. A distribution with a significant negative skewness has a long left tail. As a guideline, a
skewness value more than twice its standard error is taken to indicate a departure from
symmetry.
Specificity
A measure of the usefulness of a classification scheme. Specificity is the probability
that a "negative" case is correctly classified. 1-specificity is the false positive rate, and is
plotted on the x-axis in an ROC Curve.
standard deviation
A measure of dispersion around the mean, equal to the square root of the variance. The
standard deviation is measured in the same units as the original variable.
T
trimmed mean
The arithmetic mean calculated when the largest n% and the smallest n% of the cases
have been eliminated. Eliminating extreme cases from the computation of the mean results in a better
estimate of central tendency, especially when the data are nonnormal.
t test
A statistical test that compares the mean values of two groups. When the test result is
statistically significant, the means are different.
Z
z score
Also known as a standardized value. To obtain z-scores for a variable, for each case
subtract the variable's mean value and divide by the standard deviation. Z-scores are useful for
finding outliers and comparing values of variables that are measured on different scales.