Overview (DETECTANOMALY command)

The anomaly detection procedure searches for unusual cases based on deviations from the norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data-auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case is not specific to any particular application, such as detection of unusual payment patterns in the healthcare industry or detection of money laundering in the finance industry, in which the definition of an anomaly can be well-defined.

Options

Methods. The DETECTANOMALY procedure clusters cases into peer groups based on the similarities of a set of input variables. An anomaly index is assigned to each case to reflect the unusualness of a case with respect to its peer group. All cases are sorted by the values of the anomaly index, and the top portion of the cases is identified as the set of anomalies. For each variable, an impact measure is assigned to each case that reflects the contribution of the variable to the deviation of the case from its peer group. For each case, the variables are sorted by the values of the variable impact measure, and the top portion of variables is identified as the set of reasons why the case is anomalous.

Data File. The DETECTANOMALY procedure assumes that the input data is a flat file in which each row represents a distinct case and each column represents a distinct variable. Moreover, it is assumed that all input variables are non-constant and that no case has missing values for all of the input variables.

Missing Values. The DETECTANOMALY procedure allows missing values. By default, missing values of continuous variables are substituted by their corresponding grand means, and missing categories of categorical variables are grouped and treated as a valid category. Moreover, an additional variable called the Missing Proportion Variable, which represents the proportion of missing variables in each case, is created. The processed variables are used to detect the anomalies in the data. You can turn off either of the options. If the first is turned off, cases with missing values are excluded from the analysis. In this situation, the second option is turned off automatically.

ID Variable. A variable that is the unique identifier of the cases in the data can optionally be specified in the ID keyword. If this keyword is not specified, the case sequence number of the active dataset is assumed to be the ID.

Weights. The DETECTANOMALY procedure ignores specification on the WEIGHT command.

Output. The DETECTANOMALY procedure displays an anomaly list in pivot table output, or offers an option for suppressing it. The procedure can also save the anomaly information to the active dataset as additional variables. Anomaly information can be grouped into three sets of variables: anomaly, peer, and reason. The anomaly set consists of the anomaly index of each case. The peer set consists of the peer group ID of each case, the size, and the percentage size of the peer group. The reason set consists of a number of reasons. Each reason consists of information such as the variable impact, the variable name for this reason, the value of the variable, and the corresponding norm value of the peer group.

Basic Specification

The basic specification is the DETECTANOMALY command. By default, all variables in the active dataset are used in the procedure, with the dictionary setting of each variable in the dataset determining its measurement level.

Syntax Rules

  • All subcommands are optional.
  • Only a single instance of each subcommand is allowed.
  • An error occurs if an attribute or keyword is specified more than once within a subcommand.
  • Parentheses, slashes, and equals signs shown in the syntax chart are required.
  • Subcommand names and keywords must be spelled in full.
  • Empty subcommands are not honored.

Operations

The DETECTANOMALY procedure begins by applying the missing value handling option and the create missing proportion variable option to the data.

Then the procedure groups cases into their peer groups based on the similarities of the processed variables. An anomaly index is assigned to each case to measure the overall deviation of the case from its peer group. All cases are sorted by the values of the anomaly index, and the top portion of the cases is identified as the anomaly list.

For each anomalous case, the variables are sorted by their corresponding variable impact values. The top variables, their values, and the corresponding norm values are presented as the reasons why a case is identified as an anomaly.

By default, the anomaly list is presented in a pivot table. Optionally, the anomaly information can be added to the active dataset as additional variable. The anomaly detection model may be written to an XML model file.

Limitations

WEIGHT and SPLIT FILE settings are ignored with a warning by the DETECTANOMALY procedure.