Overview (DETECTANOMALY command)
The anomaly detection procedure searches for unusual cases based on deviations from the norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data-auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case is not specific to any particular application, such as detection of unusual payment patterns in the healthcare industry or detection of money laundering in the finance industry, in which the definition of an anomaly can be well-defined.
Options
Methods.
The DETECTANOMALY
procedure clusters
cases into peer groups based on the similarities of a set of input
variables. An anomaly index is assigned to each case to reflect the
unusualness of a case with respect to its peer group. All cases are
sorted by the values of the anomaly index, and the top portion of
the cases is identified as the set of anomalies. For each variable,
an impact measure is assigned to each case that reflects the contribution
of the variable to the deviation of the case from its peer group.
For each case, the variables are sorted by the values of the variable
impact measure, and the top portion of variables is identified as
the set of reasons why the case is anomalous.
Data File. The DETECTANOMALY
procedure assumes that the
input data is a flat file in which each row represents a distinct
case and each column represents a distinct variable. Moreover, it
is assumed that all input variables are non-constant and that no case
has missing values for all of the input variables.
Missing Values. The DETECTANOMALY
procedure
allows missing values. By default, missing values of continuous variables
are substituted by their corresponding grand means, and missing categories
of categorical variables are grouped and treated as a valid category.
Moreover, an additional variable called the Missing Proportion Variable, which represents the proportion
of missing variables in each case, is created. The processed variables
are used to detect the anomalies in the data. You can turn off either
of the options. If the first is turned off, cases with missing values
are excluded from the analysis. In this situation, the second option
is turned off automatically.
ID Variable. A variable that is the unique
identifier of the cases in the data can optionally be specified in
the ID
keyword. If this keyword
is not specified, the case sequence number of the active dataset is
assumed to be the ID
.
Weights. The DETECTANOMALY
procedure
ignores specification on the WEIGHT
command.
Output. The DETECTANOMALY
procedure displays an anomaly list in pivot table output, or offers
an option for suppressing it. The procedure can also save the anomaly
information to the active dataset as additional variables. Anomaly
information can be grouped into three sets of variables: anomaly,
peer, and reason. The anomaly set consists of the anomaly index of
each case. The peer set consists of the peer group ID of each case,
the size, and the percentage size of the peer group. The reason set
consists of a number of reasons. Each reason consists of information
such as the variable impact, the variable name for this reason, the
value of the variable, and the corresponding norm value of the peer
group.
Basic Specification
The basic specification is the DETECTANOMALY
command. By default, all variables in the active dataset are used
in the procedure, with the dictionary setting of each variable in
the dataset determining its measurement level.
Syntax Rules
- All subcommands are optional.
- Only a single instance of each subcommand is allowed.
- An error occurs if an attribute or keyword is specified more than once within a subcommand.
- Parentheses, slashes, and equals signs shown in the syntax chart are required.
- Subcommand names and keywords must be spelled in full.
- Empty subcommands are not honored.
Operations
The DETECTANOMALY
procedure
begins by applying the missing value handling option and the create
missing proportion variable option to the data.
Then the procedure groups cases into their peer groups based on the similarities of the processed variables. An anomaly index is assigned to each case to measure the overall deviation of the case from its peer group. All cases are sorted by the values of the anomaly index, and the top portion of the cases is identified as the anomaly list.
For each anomalous case, the variables are sorted by their corresponding variable impact values. The top variables, their values, and the corresponding norm values are presented as the reasons why a case is identified as an anomaly.
By default, the anomaly list is presented in a pivot table. Optionally, the anomaly information can be added to the active dataset as additional variable. The anomaly detection model may be written to an XML model file.
Limitations
WEIGHT
and SPLIT FILE
settings are ignored with a warning
by the DETECTANOMALY
procedure.