The Validate Data dialog is used to validate your data. The Variables tab shows variables in your file. Start by selecting the variables you want and moving them to the Analysis Variables list.
You can specify basic checks to apply to variables and cases in your file. For example, you can obtain reports that identify variables with a high percentage of missing values or empty cases.
Standard and custom rules
Apply rules to individual variables that identify invalid values – values outside a valid range or missing values. You can also create your own rules, cross-variable rules or apply predefined rules.
Automated data preparation delivers recommendations and allows users to drill in and examine the recommendations.
Prepare data in a single step – automatically
Manual data preparation is a complex and time-consuming process. When you need results quickly, the ADP procedure helps you detect and correct quality errors and impute missing values in one efficient step. The ADP feature provides an easy-to-understand report with comprehensive recommendations and visualizations to help you determine the right data to use in your analysis.
Additional options for data preparation
Perform automatic data checks and help eliminate time-consuming, tedious, manual checks by using the validate data procedure. This procedure enables you to apply rules to perform data checks based on each variable’s measure level (whether categorical or continuous). Then, determine data validity and remove or correct suspicious cases at your discretion prior to analysis.
Bin or set cut points for scale variables
With the optimal binning procedure, you can more accurately use algorithms designed for nominal attributes (such as Naive Bayes and logit models). Optimal binning enables you to bin – or set cut points for – scale variables.
Select from three types of optimal binning
Choose one of these types of optimal binning for preprocessing data prior to model building. 1) Unsupervised: Create bins with equal counts. 2) Supervised: Take the target variable into account to determine cut points. This method is more accurate than unsupervised; however, it is also more computationally intensive. 3) Hybrid approach: Combines the unsupervised and supervised approaches. This method is particularly useful if you have a large amount of distinct values.