# Category Codes

Some care should be taken when coding categorical variables because
some coding schemes may yield unwanted output or incomplete analyses.
Possible coding schemes for *job* are displayed in the following table.

Category | A | B | C | D |
---|---|---|---|---|

intern | 1 | 1 | 5 | 1 |

sales rep | 2 | 2 | 6 | 5 |

manager | 3 | 7 | 7 | 3 |

Some Categories procedures require that the range of every variable
used be defined. Any value outside this range is treated as a missing
value. The minimum category value is always 1. The maximum category
value is supplied by the user. This value is not the *number *of categories for a variable—it
is the *largest *category value.
For example, in the table, scheme A has a maximum category value of
3 and scheme B has a maximum category value of 7, yet both schemes
code the same three categories.

The variable range determines which categories will
be omitted from the analysis. Any categories with codes outside the
defined range are omitted from the analysis. This is a simple method
for omitting categories but can result in unwanted analyses. An incorrectly
defined maximum category can omit *valid *categories from the analysis. For example, for scheme B, defining
the maximum category value to be 3 indicates that *job *has categories coded from 1 to 3; the *manager *category is treated as missing. Because
no category has actually been coded 3, the third category in the analysis
contains no cases. If you wanted to omit all manager categories, this
analysis would be appropriate. However, if managers are to be included,
the maximum category must be defined as 7, and missing values must
be coded with values above 7 or below 1.

For variables treated as nominal or ordinal, the range
of the categories does not affect the results. For nominal variables,
only the label and not the value associated with that label is important.
For ordinal variables, the order of the categories is preserved in
the quantifications; the category values themselves are not important.
All coding schemes resulting in the same category ordering will have
identical results. For example, the first three schemes in the table
are functionally equivalent if *job *is analyzed at an ordinal level. The order of the categories is identical
in these schemes. Scheme D, on the other hand, inverts the second
and third categories and will yield different results than the other
schemes.

Although many coding schemes for a variable are functionally
equivalent, schemes with small differences between codes are preferred
because the codes have an impact on the amount of output produced
by a procedure. All categories coded with values between 1 and the
user-defined maximum are valid. If any of these categories are empty,
the corresponding quantifications will be either system-missing or
0, depending on the procedure. Although neither of these assignments
affect the analyses, output is produced for these categories. Thus,
for scheme B, *job *has four categories
that receive system-missing values. For scheme C, there are also four
categories receiving system-missing indicators. In contrast, for scheme
A there are no system-missing quantifications. Using consecutive integers
as codes for variables treated as nominal or ordinal results in much
less output without affecting the results.

Coding schemes for variables treated as numerical are
more restricted than the ordinal case. For these variables, the differences
between consecutive categories are important. The following table
displays three coding schemes for *age*.

Category | A | B | C |
---|---|---|---|

20 | 20 | 1 | 1 |

22 | 22 | 3 | 2 |

25 | 25 | 6 | 3 |

27 | 27 | 8 | 4 |

Any recoding of numerical variables must preserve the differences between the categories. Using the original values is one method for ensuring preservation of differences. However, this can result in many categories having system-missing indicators. For example, scheme A employs the original observed values. For all Categories procedures except for Correspondence Analysis, the maximum category value is 27 and the minimum category value is set to 1. The first 19 categories are empty and receive system-missing indicators. The output can quickly become rather cumbersome if the maximum category is much greater than 1 and there are many empty categories between 1 and the maximum.

To reduce the amount of output, recoding can be done.
However, in the numerical case, the Automatic Recode facility should
not be used. Coding to consecutive integers results in differences
of 1 between all consecutive categories, and, as a result, all quantifications
will be equally spaced. The metric characteristics deemed important
when treating a variable as numerical are destroyed by recoding to
consecutive integers. For example, scheme C in the table corresponds
to automatically recoding *age*.
The difference between categories 22 and 25 has changed from three
to one, and the quantifications will reflect the latter difference.

An alternative recoding scheme that preserves the differences between categories is to subtract the smallest category value from every category and add 1 to each difference. Scheme B results from this transformation. The smallest category value, 20, has been subtracted from each category, and 1 was added to each result. The transformed codes have a minimum of 1, and all differences are identical to the original data. The maximum category value is now 8, and the zero quantifications before the first nonzero quantification are all eliminated. Yet, the nonzero quantifications corresponding to each category resulting from scheme B are identical to the quantifications from scheme A.