# Missing Value Analysis

The Missing Value Analysis procedure performs three primary functions:

- Describes the pattern of missing data. Where are the missing values located? How extensive are they? Do pairs of variables tend to have values missing in multiple cases? Are data values extreme? Are values missing randomly?
- Estimates means, standard deviations, covariances, and correlations for different missing value methods: listwise, pairwise, regression, or EM (expectation-maximization). The pairwise method also displays counts of pairwise complete cases.
- Fills in (imputes) missing values with estimated values using regression or EM methods; however, multiple imputation is generally considered to provide more accurate results.

Missing value analysis helps address several concerns caused by incomplete data. If cases with missing values are systematically different from cases without missing values, the results can be misleading. Also, missing data may reduce the precision of calculated statistics because there is less information than originally planned. Another concern is that the assumptions behind many statistical procedures are based on complete cases, and missing values can complicate the theory required.

**Example.** In evaluating a treatment for leukemia, several
variables are measured. However, not all measurements are available
for every patient. The patterns of missing data are displayed, tabulated,
and found to be random. An EM analysis is used to estimate the means,
correlations, and covariances. It is also used to determine that the
data are missing completely at random. Missing values are then replaced
by imputed values and saved into a new data file for further analysis.

**Statistics.** Univariate statistics, including number of
nonmissing values, mean, standard deviation, number of missing values,
and number of extreme values. Estimated means, covariance matrix,
and correlation matrix, using listwise, pairwise, EM, or regression
methods. Little's MCAR test with EM results. Summary of means by various
methods. For groups defined by missing versus nonmissing values: *t* tests.
For all variables: missing value patterns displayed cases-by-variables.

Data Considerations

**Data.** Data can be categorical or quantitative (scale or
continuous). However, you can estimate statistics and impute missing
data only for the quantitative variables. For each variable, missing
values that are not coded as system-missing must be defined as user-missing.
For example, if a questionnaire item has the response *Don't know* coded
as 5 and you want to treat it as missing, the item should have 5 coded
as a user-missing value. See the topic Missing values for
more information.

**Frequency weights.** Frequency (replication) weights are
honored by this procedure. Cases with negative or zero replication
weight value are ignored. Noninteger weights are truncated.

**Assumptions.** Listwise, pairwise, and regression estimation
depend on the assumption that the pattern of missing values does not
depend on the data values. (This condition is known as **missing
completely at random**, or MCAR.) Therefore, all methods (including
the EM method) for estimation give consistent and unbiased estimates
of the correlations and covariances when the data are MCAR. Violation
of the MCAR assumption can lead to biased estimates produced by the
listwise, pairwise, and regression methods. If the data are not MCAR,
you need to use EM estimation.

EM estimation depends on the assumption that the pattern of missing
data is related to the observed data only. (This condition is called **missing
at random**, or MAR.) This assumption allows estimates to be adjusted
using available information. For example, in a study of education
and income, the subjects with low education may have more missing
income values. In this case, the data are MAR, not MCAR. In other
words, for MAR, the probability that income is recorded depends on
the subject's level of education. The probability may vary by education
but not by income *within that level of education*. If the probability
that income is recorded also varies by the value of income within
each level of education (for example, people with high incomes don't
report them), then the data are neither MCAR nor MAR. This is not
an uncommon situation, and, if it applies, none of the methods is
appropriate.

**Related procedures.** Many procedures allow you to use listwise
or pairwise estimation. Linear Regression and Factor Analysis allow
replacement of missing values by the mean values. In the Forecasting
add-on module, several methods are available to replace missing values
in time series.

To Obtain Missing Value Analysis

This feature requires the Missing Values option.

- From the menus choose:
- Select at least one quantitative (scale) variable for estimating statistics and optionally imputing missing values.

Optionally, you can:

- Select categorical variables (numeric or string) and enter a limit on the number of categories (Maximum Categories).
- Click Patterns to tabulate patterns of missing data. See the topic Displaying Patterns of Missing Values for more information.
- Click Descriptives to display descriptive statistics of missing values. See the topic Displaying Descriptive Statistics for Missing Values for more information.
- Select a method for estimating statistics (means, covariances, and correlations) and possibly imputing missing values. See the topic Estimating Statistics and Imputing Missing Values for more information.
- If you select EM or Regression, click Variables to specify a subset to be used for the estimation. See the topic Predicted and Predictor Variables for more information.
- Select a case label variable. This variable is used to label cases in patterns tables that display individual cases.