Overview (MULTIPLE IMPUTATION command)

The MULTIPLE IMPUTATION procedure performs multiple imputation of missing data values. Given a dataset containing missing values, it outputs one or more datasets in which missing values are replaced with plausible estimates. The procedure also summarizes missing values in the working dataset.

Datasets produced by the MULTIPLE IMPUTATION procedure can be analyzed using supported analysis procedures to obtain final (combined) parameter estimates that take into account the inherent uncertainty among the various sets of imputed values.

Options

Variables. You can specify which variables to impute and specify constraints on the imputed values such as minimum and maximum values. You can also specify which variables are used as predictors when imputing missing values of other variables.

Methods. Three imputation methods are offered. The monotone method is an efficient method for data that have a monotone pattern of missingness. Fully conditional specification (FCS) is an iterative Markov Chain Monte Carlo (MCMC) method that is appropriate when the data have an arbitrary (monotone or nonmonotone) missing pattern. The default method (AUTO) scans the data to determine the best imputation method (monotone or FCS). For each method you can control the number of imputations.

Output. By default the procedure displays an overall summary of missingness in your data as well as an imputation summary and the imputation model for each variable whose values are imputed. You can obtain analysis of missing values by variable as well as tabulated patterns of missing values. If you request imputation you can obtain descriptive statistics for imputed values.

Basic Specification

The basic specification is two or more variables and a file specification where the imputed data will be written.

  • By default, the procedure imputes missing values using the AUTO method. Five imputations are performed.
  • When imputing the default model type depends on the measurement level of the variable: for categorical variables, logistic regression is used; for scale variables linear regression is used.
  • The procedure generates output that summarizes missingness in the data and summaries how values were imputed.

Operations

  • The output dataset contains the original (nonmissing) data and data for one or more imputations. Each imputation includes all of the observed data and imputed data values. The original and imputed data are stacked in the output dataset. A special variable, Imputation_, identifies whether a case represents original data (Imputation_ = 0) or imputed data (Imputation_ =1…m).
  • Multiple imputation datasets can be analyzed using supported analysis procedures to obtain final (combined) parameter estimates that take into account the inherent uncertainty in the various sets of imputed values. Variable Imputation_ must be defined as a split variable in order to obtain pooled parameter estimates.
  • The procedure honors the random number generator and seed specified via the global SET command. Specify the same seed across invocations of the procedure if you want to be able to reproduce your imputed values.
  • The procedure honors the WEIGHT variable. It is treated as a replication weight when summarizing missing values and estimating imputation models. Cases with negative or zero value of replication weight are ignored. Noninteger weights are rounded to the nearest integer. The procedure also accepts analysis weights (see ANALYSISWEIGHT subcommand).
  • The procedure honors SPLIT FILE. A separate missing value analysis and set of imputations is produced for each combination of values of the split variables. An error occurs if imputation is requested and the input dataset has eight split variables.
  • The procedure honors the FILTER command. Cases filtered out are ignored by the procedure.
  • The procedure accepts string variables and treats them as categorical. Completely blank string values are treated as valid values, i.e., they are not replaced.
  • The procedure treats both user- and system-missing values as invalid values. Both types of missing values are replaced when values are imputed and both are treated as invalid values of variables used as predictors in imputation models. User- and system-missing values are also treated as missing in analyses of missingness (counts of missing values, etc.).
  • Cases that have a missing value for each analysis variable are included in analyses of missingness but are excluded from imputation. Specifically, values of such cases are not imputed and are excluded when building imputation models. The determination of which cases are completely missing is made after any variables are filtered out of the imputation model by the MAXPCTMISSING keyword.
  • An error occurs if imputation is requested and the input dataset contains a variable named Imputation_.
  • An error occurs if imputation and iteration history are requested and the input dataset contains a variable named Iteration_, or SummaryStatistic_.
Note: Measurement level can affect the results. If any variables (fields) have an unknown measurement level, a data pass is performed to determine the measurement level before the analysis begins. For information on the determination criteria, see SET SCALEMIN.

Syntax Rules

  • Two or more analysis variables are required.
  • The OUTFILE subcommand is required unless imputation is turned off (/IMPUTE METHOD=NONE). All other subcommands are optional.
  • Only a single instance of each subcommand is allowed except for the CONSTRAINTS command, which can be repeated.
  • An error occurs if an attribute or keyword is specified more than once within a subcommand.
  • Equals signs shown in the syntax chart are required.
  • Subcommand names and keywords must be spelled in full.
  • Empty subcommands are not allowed.