Overview (ADP command)

Automated Data Preparation helps to prepare data for analysis by automating tedious and repetitive data preparation tasks that would otherwise be done manually. The operations it performs improve analysis speed, predictive power, and robustness. A key capability of the component is feature space construction—the discovery of useful sets of predictors from the data through transformation and combination of existing fields. Feature selection offers the ability to narrow the attribute space by screening out irrelevant fields, but Automated Data Preparation pairs selection and construction capabilities in order to automatically remove irrelevant fields that slow down or confuse algorithms and create new fields that boost predictive power.

Note that supported operations are performed without knowing what algorithms will be run on the data in further analyses—it is not a generalized data cleaner, nor does it have an understanding of business rules. Basic cleaning and integrity checks can be done using the IBM® SPSS® Statistics Data Validation procedure.

Options

Date and Time Handling. The year, month, and day can be extracted from fields containing dates, and new fields containing the durations since a reference date computed. Likewise, the hour, minute, and second can be extracted from fields containing times, and new fields containing the time since a reference time computed.

Screening. Fields with too many missing values, and categorical fields with too many unique values, or too many values concentrated in a single value, can be screened and excluded from further analysis.

Rescaling. Continuous inputs can optionally be rescaled using a z score or min-max transformation. A continuous target can optionally be rescaled using a Box-Cox transformation.

Transformations. The procedure can suggest transformations used to merge similar categories of categorical inputs, bin values of continuous inputs, and construct and select new input fields from continuous inputs using principal components analysis.

Other Target and Input Handling. The procedure can apply rules for handling outliers, replace missing values, recode the categories of nominal fields, and adjust the measurement level of continuous and ordinal fields.

Output. The procedure creates an XML file containing suggested operations. This can be merged with a model XML file using the Merge Model XML dialog (Utilities>Merge Model XML) or transformed into command syntax using TMS IMPORT.

Basic Specification

The basic specification is the ADP command with a FIELDS subcommand specifying the inputs and optionally a target, and an OUTFILE subcommand specifying where the transformation rules should be saved.

Syntax Rules

  • The VARIABLES and OUTFILE subcommands are required; all other subcommands are optional.
  • Subcommands may be specified in any order.
  • Only a single instance of each subcommand is allowed.
  • An error occurs if a keyword is specified more than once within a subcommand.
  • Parentheses, equals signs, and slashes shown in the syntax chart are required.
  • The command name, subcommand names, and keywords must be spelled in full.
  • Empty subcommands are not allowed.

Limitations

  • SPLIT FILE is ignored by this command.