Evaluating the Descriptive Statistics

For this example, the output includes:

  • Univariate statistics
  • Table of separate-variance t tests, including subgroup means when another variable is present or missing
  • Tables for each categorical variable showing frequencies of missing data for each category by each quantitative (scale) variable
Figure 1. Univariate statistics table
The Univariate Statistics table produced by Missing Value Analysis.

The univariate statistics provide your first look, variable by variable, at the extent of missing data. The number of nonmissing values for each variable appears in the N column, and the number of missing values appears in the Missing Count column. The Missing Percent column displays the percentage of cases with missing values and provides a good measure for comparing the extent of missing data among variables. income (Household income in thousands) has the greatest number of cases with missing values (17.9%), while age (Age in years) has the least (2.5%). income also has the greatest number of extreme values.

Figure 2. Separate-variance t tests table
The separate-variance t tests table produced by Missing Value Analysis.

The separate-variance t tests table can help to identify variables whose pattern of missing values may be influencing the quantitative (scale) variables. The t test is computed using an indicator variable that specifies whether a variable is present or missing for an individual case. The subgroup means for the indicator variable are also tabulated. Note that an indicator variable is created only if a variable has missing values in at least 5% of the cases.

It appears that older respondents are less likely to report income levels. When income is missing, the mean age is 49.73, compared to 40.01 when income is nonmissing. In fact, the missingness of income seems to affect the means of several of the quantitative (scale) variables. This is one indication that the data may not be missing completely at random.

Figure 3. Crosstabulation for Marital status [marital]
Crosstabulation of MaritalStatus categories versus indicator variables.

The crosstabulations of categorical variables versus indicator variables show information similar to that found in the separate-variance t test table. Indicator variables are once again created, except this time they are used to calculate frequencies in every category for each categorical variable. The values can help you determine whether there are differences in missing values among categories.

Looking at the table for marital (Marital status), the number of missing values in the indicator variables do not appear to vary much between marital categories. Whether someone is married or unmarried does not seem to affect whether data are missing for any of the quantitative (scale) variables. For example, unmarried people reported address (Years at current a)ddress 85.5% of the time, and married people reported the same variable 83.4% of the time. The difference is minimal and likely due to chance.

Figure 4. Crosstabulation for Level of education [ed]
Crosstabulation of EducationalLevel categories versus indicator variables.

Now consider the crosstabulation for ed (Level of education). If a respondent has at least some college education, a response for marital status is more likely to be missing. At least 98.5% of the respondents with no college education reported marital status. On the other hand, only 81.1% of those with a college degree reported marital status. The number is even lower for those with some college education but no degree.

Figure 5. Crosstabulation for Retired [retire]
Crosstabulation of RetirementStatus categories versus indicator variables.

A more drastic difference can be seen in retire (Retired). Those who are retired are much less likely to report their income compared to those who are not retired. Only 46.3% of the retired customers reported income level, while the percentage of those who are not retired and reported income level was 83.7.

Figure 6. Crosstabulation for Gender [gender]
Crosstabulation of Gender categories versus indicator variables.

Another discrepancy is apparent for gender (Gender). Address information is missing more often for males than for females. Although these discrepancies could be due to chance, it seems unlikely. The data do not appear to be missing completely at random.

We will look at the patterns of missing data to explore this further.

Next