Missing Value Analysis

The Missing Value Analysis procedure performs three primary functions:

  • Describes the pattern of missing data. Where are the missing values located? How extensive are they? Do pairs of variables tend to have values missing in multiple cases? Are data values extreme? Are values missing randomly?
  • Estimates means, standard deviations, covariances, and correlations for different missing value methods: listwise, pairwise, regression, or EM (expectation-maximization). The pairwise method also displays counts of pairwise complete cases.
  • Fills in (imputes) missing values with estimated values using regression or EM methods; however, multiple imputation is generally considered to provide more accurate results.

Missing value analysis helps address several concerns caused by incomplete data. If cases with missing values are systematically different from cases without missing values, the results can be misleading. Also, missing data may reduce the precision of calculated statistics because there is less information than originally planned. Another concern is that the assumptions behind many statistical procedures are based on complete cases, and missing values can complicate the theory required.

Example. In evaluating a treatment for leukemia, several variables are measured. However, not all measurements are available for every patient. The patterns of missing data are displayed, tabulated, and found to be random. An EM analysis is used to estimate the means, correlations, and covariances. It is also used to determine that the data are missing completely at random. Missing values are then replaced by imputed values and saved into a new data file for further analysis.

Statistics. Univariate statistics, including number of nonmissing values, mean, standard deviation, number of missing values, and number of extreme values. Estimated means, covariance matrix, and correlation matrix, using listwise, pairwise, EM, or regression methods. Little's MCAR test with EM results. Summary of means by various methods. For groups defined by missing versus nonmissing values: t tests. For all variables: missing value patterns displayed cases-by-variables.

Data Considerations

Data. Data can be categorical or quantitative (scale or continuous). However, you can estimate statistics and impute missing data only for the quantitative variables. For each variable, missing values that are not coded as system-missing must be defined as user-missing. For example, if a questionnaire item has the response Don't know coded as 5 and you want to treat it as missing, the item should have 5 coded as a user-missing value. See the topic Missing values for more information.

Frequency weights. Frequency (replication) weights are honored by this procedure. Cases with negative or zero replication weight value are ignored. Noninteger weights are truncated.

Assumptions. Listwise, pairwise, and regression estimation depend on the assumption that the pattern of missing values does not depend on the data values. (This condition is known as missing completely at random, or MCAR.) Therefore, all methods (including the EM method) for estimation give consistent and unbiased estimates of the correlations and covariances when the data are MCAR. Violation of the MCAR assumption can lead to biased estimates produced by the listwise, pairwise, and regression methods. If the data are not MCAR, you need to use EM estimation.

EM estimation depends on the assumption that the pattern of missing data is related to the observed data only. (This condition is called missing at random, or MAR.) This assumption allows estimates to be adjusted using available information. For example, in a study of education and income, the subjects with low education may have more missing income values. In this case, the data are MAR, not MCAR. In other words, for MAR, the probability that income is recorded depends on the subject's level of education. The probability may vary by education but not by income within that level of education. If the probability that income is recorded also varies by the value of income within each level of education (for example, people with high incomes don't report them), then the data are neither MCAR nor MAR. This is not an uncommon situation, and, if it applies, none of the methods is appropriate.

Related procedures. Many procedures allow you to use listwise or pairwise estimation. Linear Regression and Factor Analysis allow replacement of missing values by the mean values. In the Forecasting add-on module, several methods are available to replace missing values in time series.

To Obtain Missing Value Analysis

This feature requires the Missing Values option.

  1. From the menus choose:

    Analyze > Missing Value Analysis...

  2. Select at least one quantitative (scale) variable for estimating statistics and optionally imputing missing values.

Optionally, you can:

  • Select categorical variables (numeric or string) and enter a limit on the number of categories (Maximum Categories).
  • Click Patterns to tabulate patterns of missing data. See the topic Displaying Patterns of Missing Values for more information.
  • Click Descriptives to display descriptive statistics of missing values. See the topic Displaying Descriptive Statistics for Missing Values for more information.
  • Select a method for estimating statistics (means, covariances, and correlations) and possibly imputing missing values. See the topic Estimating Statistics and Imputing Missing Values for more information.
  • If you select EM or Regression, click Variables to specify a subset to be used for the estimation. See the topic Predicted and Predictor Variables for more information.
  • Select a case label variable. This variable is used to label cases in patterns tables that display individual cases.