Overview (VALIDATEDATA command)

The VALIDATEDATA procedure identifies suspicious and invalid cases, variables, and data values in the active dataset.

The procedure can summarize two types of validation rules. Single-variable rules consist of a fixed set of checks that are applied to individual data values, such as range checks. Cross-variable rules are user-specified rules that typically examine combinations of data values for two or more variables.

Options

Analysis Variables. The procedure can identify variables that have a high proportion of missing values as well as variables that are constant (or nearly so). You can set the maximum acceptable percentage of missing values as well as thresholds for considering categorical and scale variables as constants.

Case Identifiers. The procedure can identify incomplete and duplicate case identifiers. A single variable or a combination of variables can be treated as the identifier for a case.

Cases. The procedure can identify empty cases. A case can be regarded as empty if all analysis variables are blank or missing or all non-ID variables are blank or missing.

Single-Variable Rules. Single-variable validation rules can be summarized by variable and rule. Single-variable rules are defined using the DATAFILE ATTRIBUTE command and are linked to analysis variables using the VARIABLE ATTRIBUTE command.

Cross-Variable Rules. The procedure can summarize violations of cross-variable validation rules. Cross-variable rules are defined using the DATAFILE ATTRIBUTE command.

Saved Variables. You can save variables that identify suspicious cases and values in the active dataset.

Basic Specification

  • The minimum specification is a list of analysis variables, case identifier variables, or cross-variable rules.
  • By default, if you specify analysis variables, the procedure reports all analysis variables that have a high proportion of missing values, as well as analysis variables that are constant or nearly so. If single-variable validation rules are defined for analysis variables, the procedure reports rule violations for each variable.
  • By default, if case identifier variables are specified, the procedure reports all incomplete and duplicate case identifiers.
  • If you specify cross-variable rules, the procedure reports the total number of cases that violated each rule.
  • If you specify single-variable or cross-variable rules, a casewise report is shown that lists the first 500 cases that violated at least one validation rule.
  • Empty cases are identified by default.

Syntax Rules

  • Each subcommand is global and can be used only once.
  • Subcommands can be used in any order.
  • An error occurs if a keyword or attribute is specified more than once within a subcommand.
  • Equals signs and slashes shown in the syntax chart are required.
  • Subcommand names and keywords must be spelled out in full.
  • Empty subcommands generate a procedure error.

Operations

  • The procedure honors SPLIT FILE specifications. Split variables are filtered out of all variable lists.
  • The procedure treats user- and system-missing values as missing values.
  • An error occurs if no procedure output is requested.

Note: Since measurement level can affect the results, if any variables (fields) have an unknown measurement level, an initial data pass will be performed to determine default measurement level for any variables with an unknown measurement level. For information on the criteria used to determine default measurement level, see SET SCALEMIN.

Limitations

  • The weight variable, if specified, is ignored with a warning.

Validation Rules

  • VALIDATEDATA ignores invalid rules and links with a warning.
  • If an analysis variable is linked to multiple rules, VALIDATEDATA applies the rules independently; it does not check for conflicts or redundancies among the constituent rules.
  • A rule outcome variable must be associated with each rule that indicates which cases violated the rule. Outcome variables are created outside the VALIDATEDATA procedure and should be coded such that 1 indicates an invalid value or combination of values and 0 indicates a valid value.
  • An error occurs if the same outcome variable is referenced by two or more rules.
  • Values of all outcome variables are assumed to be current with respect to the active dataset.