EM Subcommand (MVA command)
The EM
subcommand
uses an EM (expectation-maximization) algorithm to estimate the means,
the covariances, and the Pearson correlations of quantitative variables.
This process is an iterative process, which uses two steps for each
iteration. The E step computes expected values conditional on the
observed data and the current estimates of the parameters. The M step
calculates maximum-likelihood estimates of the parameters based on
values that are computed in the E step.
- If no variables are listed in the
EM
subcommand, estimates are performed for all quantitative variables in the variables list. - If you want to limit the estimation to a subset of
the variables in the list, specify a subset of quantitative variables
to be estimated after the subcommand name
EM
. You can also list, after the keywordWITH
, the quantitative variables to be used in estimating. - The output includes tables of means, correlations, and covariances.
- The estimation, by default, assumes that the data
are normally distributed. However, you can specify a multivariate t distribution with a specified number of
degrees of freedom or a mixed normal distribution with any mixture
proportion (
PROPORTION
) and any standard deviation ratio (LAMBDA
). - You can save a data file with the missing values filled in. You must specify a filename and its complete path in single or double quotation marks.
- Criteria keywords and
OUTFILE
specifications must be enclosed in a single pair of parentheses.
The criteria for the EM
subcommand are as follows:
TOLERANCE=value. Numerical accuracy control. Helps eliminate predictor variables that are highly correlated with other predictor variables and would reduce the accuracy of the matrix inversions that are involved in the calculations. The smaller the tolerance, the more inaccuracy is tolerated. The default value is 0.001.
CONVERGENCE=value. Convergence criterion. Determines when iteration ceases. If the relative change in the likelihood function is less than this value, convergence is assumed. The value of this ratio must be between 0 and 1. The default value is 0.0001.
ITERATIONS=n. Maximum number of iterations. Limits the number of iterations in the EM algorithm. Iteration stops after this many iterations even if the convergence criterion is not satisfied. The default value is 25.
Possible distribution assumptions are as follows:
TDF=n. Student’s t distribution with n degrees of freedom. The degrees of freedom must be specified if you use this keyword. The degrees of freedom must be an integer that is greater than or equal to 2.
LAMBDA=a. Ratio of standard deviations of a mixed normal distribution. Any positive real number can be specified.
PROPORTION=b. Mixture proportion of two normal distributions. Any real number between 0 and 1 can specify the mixture proportion of two normal distributions.
The following keyword produces a new data file:
OUTFILE='file'. Specify a filename
or previously declared dataset name. Filenames should
be enclosed in quotation marks and are stored in the working directory
unless a path is included as part of the file specification. Datasets
are available during the current session but are not available in
subsequent sessions unless you explicitly save them as data files.
Missing values for predicted variables in the file are filled in by
using the EM algorithm. (Note that the data that are completed with
EM-based imputations will not in general reproduce the EM estimates
from MVA
.)
Examples
MVA VARIABLES=males to tuition
/EM (OUTFILE='/colleges/emdata.sav').
- All variables on the variables list are included in the estimations.
- The output includes the means of the listed variables, a correlation matrix, and a covariance matrix.
- A new data file named emdata.sav with imputed values is saved in the /colleges directory.
MVA VARIABLES=all /EM males msport WITH males msport gradrate facratio.
- For males and msport, the output includes a vector of means, a correlation matrix, and a covariance matrix.
- The values in the tables are calculated by using
imputed values for males and msport. Existing observations for males, msport, gradrate, and facratio are used to impute the values that
are used to estimate the means, correlations, and covariances.
MVA VARIABLES=males to tuition /EM verbal math WITH males msport gradrate facratio (TDF=3 OUTFILE '/colleges/emdata.sav').
- The analysis uses a t distribution with three degrees of freedom.
- A new data file named emdata.sav with imputed values is saved in the /colleges directory.