Glossary

A

asymptotic significance

There are some situations in which the distribution of a test statistic is not well defined. Often, however, as the number of observations used to compute the statistic increases, its distribution begins to approximate a known distribution. This well-defined distribution is then used to calculate the significance value of the test statistic.

B

Bernoulli

The Bernoulli distribution takes values of 0 and 1. A Bernoulli variate takes the value of 1 with probability equal to the specified probability parameter.

boxplot

Boxplots characterize the distribution of a variable, displaying its median and quartiles. Special symbols identify the position of outliers, if any.

C

categorical

A variable with a discrete number of values; an ordinal or nominal variable. Categorical variables are often used as grouping variables or factors.

cell

A cell is the cross-classification of levels from one or more factors. For example, if you have customer factors for geographic region, marital status, and educational level, then married college graduates in your northern sales territory constitute a cell.

central tendency

An attribute of a distribution concerning where the values of the distribution tend to "congregate". Measures of central tendency include the mean, median, and mode.

correlation

Two variables are correlated if a change in the value of one signifies a change in the other. The most common measure of correlation is the Pearson correlation, which measures the degree to which the relationship between two variables can be described by a straight line.

Covariate

A scale variable that has been added to a model. In a predictive model, changes in the value of a covariate should be associated with changes in the value of the target (dependent) variable.

cut point

A cut point is used to separate cases into two groups, based upon whether values of a numeric variable fall above or below the cut point.

D

dispersion

An attribute of a frequency distribution concerning the spread of the values. Measures of dispersion include the variance, standard deviation, and interquartile range.

E

Euclidean distance

The distance between two points that is computed by joining them with a straight line.

F

Factor

An independent variable defining groups of cases.

K

Kurtosis

A measure of the extent to which there are outliers. For a normal distribution, the value of the kurtosis statistic is zero. Positive kurtosis indicates that the data exhibit more extreme outliers than a normal distribution. Negative kurtosis indicates that the data exhibit less extreme outliers than a normal distribution. The definition of kurtosis that is used, where the value is 0 for a normal distribution, is sometimes referred to as excess kurtosis. Some software may report kurtosis such that the value is 3 for a normal distribution.

L

layer

If rows give a table height and columns give it width, then layers give it depth.

level

The values of a factor are referred to as levels of the factor, or factor levels.

log transformation

In order to correct for non-normality, the natural logarithm can be applied to a positive-valued variable. This technique is most effective when the non-normality is due to positive skewness.

M

Mean

A measure of central tendency. The arithmetic average, the sum divided by the number of cases.

Median

The value above and below which half of the cases fall, the 50th percentile. If there is an even number of cases, the median is the average of the two middle cases when they are sorted in ascending or descending order. The median is a measure of central tendency not sensitive to outlying values (unlike the mean, which can be affected by a few extremely high or low values).

N

Nominal

A variable can be treated as nominal when its values represent categories with no intrinsic ranking (for example, the department of the company in which an employee works). Examples of nominal variables include region, postal code, and religious affiliation.

Normal distribution

The normal, or Gaussian, distribution is defined by its location (mean) and scale (standard deviation) parameters. Its density function has a bell shape which is symmetric about its mean. About 68% of the values of a normal variate will fall within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3.

O

OLAP

An OLAP cube is a table of results summarized across several grouping variables, which can then be manipulated or rearranged interactively. For example, you may have sales figures summarized by geographic region, product type, customer type, month, and sales indicator (units ordered, revenue, profit, and so on).

Ordinal

A variable can be treated as ordinal when its values represent categories with some intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly satisfied). Examples of ordinal variables include attitude scores representing degree of satisfaction or confidence and preference rating scores.

outlier

An outlier is an observation whose value is distant from the values of the majority of observations. It is sometimes more technically defined as a value whose distance from the nearest quartile is greater than 1.5 times the interquartile range. Outliers pull the mean in their direction, and should always be carefully examined.

P

pairwise

When computing a measure of association between two variables in a larger set, cases are included in the computation when the two variables have nonmissing values, irrespective of the values of the other variables in the set.

practical significance

A statistical test will answer the question, "Is there a difference between two groups?" but not the follow-up "Is that difference large enough for me to care?" It is up to you to determine whether test results are useful to your situation.

S

Scale

A variable can be treated as scale (continuous) when its values represent ordered categories with a meaningful metric, so that distance comparisons between values are appropriate. Examples of scale variables include age in years and income in thousands of dollars.

Sensitivity

A measure of the usefulness of a classification scheme. Sensitivity is the probability that a "positive" case is correctly classified, and is plotted on the y-axis in an ROC curve. 1-sensitivity is the false negative rate.

Skewness

A measure of the asymmetry of a distribution. The normal distribution is symmetric and has a skewness value of 0. A distribution with a significant positive skewness has a long right tail. A distribution with a significant negative skewness has a long left tail. As a guideline, a skewness value more than twice its standard error is taken to indicate a departure from symmetry.

Specificity

A measure of the usefulness of a classification scheme. Specificity is the probability that a "negative" case is correctly classified. 1-specificity is the false positive rate, and is plotted on the x-axis in an ROC Curve.

standard deviation

A measure of dispersion around the mean, equal to the square root of the variance. The standard deviation is measured in the same units as the original variable.

T

trimmed mean

The arithmetic mean calculated when the largest n% and the smallest n% of the cases have been eliminated. Eliminating extreme cases from the computation of the mean results in a better estimate of central tendency, especially when the data are nonnormal.

t test

A statistical test that compares the mean values of two groups. When the test result is statistically significant, the means are different.

Z

z score

Also known as a standardized value. To obtain z-scores for a variable, for each case subtract the variable's mean value and divide by the standard deviation. Z-scores are useful for finding outliers and comparing values of variables that are measured on different scales.