# Identify Unusual Cases

The Anomaly Detection procedure searches for unusual cases based on deviations from the norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data-auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case is not specific to any particular application, such as detection of unusual payment patterns in the healthcare industry or detection of money laundering in the finance industry, in which the definition of an anomaly can be well-defined.

**Example.** A data analyst hired to build predictive models for stroke treatment outcomes is
concerned about data quality because such models can be sensitive to unusual observations. Some of
these outlying observations represent truly unique cases and are thus unsuitable for prediction,
while other observations are caused by data entry errors in which the values are technically
"correct" and thus cannot be caught by data validation procedures. The Identify Unusual Cases
procedure finds and reports these outliers so that the analyst can decide how to handle them.

**Statistics.** The procedure produces peer groups, peer group norms for continuous and
categorical variables, anomaly indices based on deviations from peer group norms, and variable
impact values for variables that most contribute to a case being considered unusual.

Data Considerations

**Data.** This procedure works with both continuous and categorical variables.
Each row represents a distinct observation, and each column represents
a distinct variable upon which the peer groups are based. A case
identification variable can be available in the data file for marking
output, but it will not be used in the analysis. Missing values are
allowed. The weight variable, if specified, is ignored.

The detection model can be applied to a new test data file. The elements of the test data must be the same as the elements of the training data. And, depending on the algorithm settings, the missing value handling that is used to create the model may be applied to the test data file prior to scoring.

**Case order.** Note that the solution may
depend on the order of cases. To minimize order effects, randomly
order the cases. To verify the stability of a given solution, you
may want to obtain several different solutions with cases sorted in
different random orders. In situations with extremely large file sizes,
multiple runs can be performed with a sample of cases sorted in different
random orders.

**Assumptions.** The algorithm assumes that
all variables are nonconstant and independent and that no case has
missing values for any of the input variables. Each continuous variable
is assumed to have a normal (Gaussian) distribution, and each categorical
variable is assumed to have a multinomial distribution. Empirical
internal testing indicates that the procedure is fairly robust to
violations of both the assumption of independence and the distributional
assumptions, but be aware of how well these assumptions are met.

Use the Bivariate Correlations procedure to test the independence of two continuous variables. Use the Crosstabs procedure to test the independence of two categorical variables. Use the Means procedure to test the independence between a continuous variable and categorical variable. Use the Explore procedure to test the normality of a continuous variable. Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution.

To identify unusual cases

- From the menus choose:
- Select at least one analysis variable.
- Optionally, choose a case identifier variable to use in labeling output.

Fields with Unknown Measurement Level

The Measurement Level alert is displayed when the measurement level for one or more variables (fields) in the dataset is unknown. Since measurement level affects the computation of results for this procedure, all variables must have a defined measurement level.

**Scan Data.** Reads the data in the active dataset
and assigns default measurement level to any fields with a currently
unknown measurement level. If the dataset is large, that may take
some time.

**Assign Manually.** Opens a
dialog that lists all fields with an unknown measurement level. You
can use this dialog to assign measurement level to those fields. You
can also assign measurement level in Variable View of the Data Editor.

Since measurement level is important for this procedure, you cannot access the dialog to run this procedure until all fields have a defined measurement level.

This procedure pastes DETECTANOMALY command syntax.