EM Subcommand (MVA command)

The EM subcommand uses an EM (expectation-maximization) algorithm to estimate the means, the covariances, and the Pearson correlations of quantitative variables. This process is an iterative process, which uses two steps for each iteration. The E step computes expected values conditional on the observed data and the current estimates of the parameters. The M step calculates maximum-likelihood estimates of the parameters based on values that are computed in the E step.

  • If no variables are listed in the EM subcommand, estimates are performed for all quantitative variables in the variables list.
  • If you want to limit the estimation to a subset of the variables in the list, specify a subset of quantitative variables to be estimated after the subcommand name EM. You can also list, after the keyword WITH, the quantitative variables to be used in estimating.
  • The output includes tables of means, correlations, and covariances.
  • The estimation, by default, assumes that the data are normally distributed. However, you can specify a multivariate t distribution with a specified number of degrees of freedom or a mixed normal distribution with any mixture proportion (PROPORTION) and any standard deviation ratio (LAMBDA).
  • You can save a data file with the missing values filled in. You must specify a filename and its complete path in single or double quotation marks.
  • Criteria keywords and OUTFILE specifications must be enclosed in a single pair of parentheses.

The criteria for the EM subcommand are as follows:

TOLERANCE=value. Numerical accuracy control. Helps eliminate predictor variables that are highly correlated with other predictor variables and would reduce the accuracy of the matrix inversions that are involved in the calculations. The smaller the tolerance, the more inaccuracy is tolerated. The default value is 0.001.

CONVERGENCE=value. Convergence criterion. Determines when iteration ceases. If the relative change in the likelihood function is less than this value, convergence is assumed. The value of this ratio must be between 0 and 1. The default value is 0.0001.

ITERATIONS=n. Maximum number of iterations. Limits the number of iterations in the EM algorithm. Iteration stops after this many iterations even if the convergence criterion is not satisfied. The default value is 25.

Possible distribution assumptions are as follows:

TDF=n. Student’s t distribution with n degrees of freedom. The degrees of freedom must be specified if you use this keyword. The degrees of freedom must be an integer that is greater than or equal to 2.

LAMBDA=a. Ratio of standard deviations of a mixed normal distribution. Any positive real number can be specified.

PROPORTION=b. Mixture proportion of two normal distributions. Any real number between 0 and 1 can specify the mixture proportion of two normal distributions.

The following keyword produces a new data file:

OUTFILE='file'. Specify a filename or previously declared dataset name. Filenames should be enclosed in quotation marks and are stored in the working directory unless a path is included as part of the file specification. Datasets are available during the current session but are not available in subsequent sessions unless you explicitly save them as data files. Missing values for predicted variables in the file are filled in by using the EM algorithm. (Note that the data that are completed with EM-based imputations will not in general reproduce the EM estimates from MVA.)

Examples

MVA VARIABLES=males to tuition
 /EM (OUTFILE='/colleges/emdata.sav').
  • All variables on the variables list are included in the estimations.
  • The output includes the means of the listed variables, a correlation matrix, and a covariance matrix.
  • A new data file named emdata.sav with imputed values is saved in the /colleges directory.
    MVA VARIABLES=all
     /EM males msport WITH males msport gradrate facratio.
  • For males and msport, the output includes a vector of means, a correlation matrix, and a covariance matrix.
  • The values in the tables are calculated by using imputed values for males and msport. Existing observations for males, msport, gradrate, and facratio are used to impute the values that are used to estimate the means, correlations, and covariances.
    MVA VARIABLES=males to tuition
     /EM verbal math WITH males msport gradrate facratio 
      (TDF=3 OUTFILE '/colleges/emdata.sav').
  • The analysis uses a t distribution with three degrees of freedom.
  • A new data file named emdata.sav with imputed values is saved in the /colleges directory.