Overview (NAIVEBAYES command)

The NAIVEBAYES procedure can be used in three ways:

  1. Predictor selection followed by model building. The procedure submits a set of predictor variables and selects a smaller subset. Based on the Naïve Bayes model for the selected predictors, the procedure then classifies cases.
  2. Predictor selection only. The procedure selects a subset of predictors for use in subsequent predictive modeling procedures but does not report classification results.
  3. Model building only. The procedure fits the Naïve Bayes classification model by using all input predictors.

NAIVEBAYES is available for categorical dependent variables only and is not intended for use with a very large number of predictors.

Options

Methods. The NAIVEBAYES procedure performs predictor selection followed by model building, or the procedure performs predictor selection only, or the procedure performs model building only.

Training and test data. NAIVEBAYES optionally divides the dataset into training and test samples. Predictor selection uses the training data to compute the underlying model, and either the training or the test data can be used to determine the “best” subset of predictors. If the dataset is partitioned, classification results are given for both the training and test samples. Otherwise, results are given for the training sample only.

Binning. The procedure automatically distributes scale predictors into 10 bins, but the number of bins can be changed.

Memory allocation. The NAIVEBAYES procedure automatically allocates 128MB of memory for storing training records when computing average log-likelihoods. The amount of memory that is allocated for this task can be modified.

Timer. The procedure automatically limits processing time to 5 minutes, but a different time limit can be specified.

Maximum or exact subset size. Either a maximum or an exact size can be specified for the subset of selected predictors. If a maximum size is used, the procedure creates a sequence of subsets, from an initial (smaller) subset to the maximum-size subset. The procedure then selects the “best” subset from this sequence.

Missing values. Cases with missing values for the dependent variable or for all predictors are excluded. The NAIVEBAYES procedure has an option for treating user-missing values of categorical variables as valid. User-missing values of scale variables are always treated as invalid.

Output. NAIVEBAYES displays pivot table output by default but offers an option for suppressing most such output. The procedure displays the lists of selected categorical and scale predictors in a text block. These lists can be copied for use in subsequent modeling procedures. The NAIVEBAYES procedure also optionally saves predicted values and probabilities based on the Naïve Bayes model.

Basic Specification

The basic specification is the NAIVEBAYES command followed by a dependent variable.

By default, NAIVEBAYES treats all variables — except the dependent variable and the weight variable if it is defined — as predictors, with the dictionary setting of each predictor determining its measurement level. NAIVEBAYES selects the “best” subset of predictors (based on the Naïve Bayes model) and then classifies cases by using the selected predictors. User-missing values are excluded and pivot table output is displayed by default.

Syntax Rules

  • All subcommands are optional.
  • Subcommands may be specified in any order.
  • Only a single instance of each subcommand is allowed.
  • An error occurs if a keyword is specified more than once within a subcommand.
  • Parentheses, equal signs, and slashes that are shown in the syntax chart are required.
  • The command name, subcommand names, and keywords must be spelled in full.
  • Empty subcommands are not honored.

Operations

The NAIVEBAYES procedure automatically excludes cases and predictors with any of the following properties:

  • Cases with a missing value for the dependent variable.
  • Cases with missing values for all predictors.
  • Predictors with missing values for all cases.
  • Predictors with the same value for all cases.

The NAIVEBAYES procedure requires predictors to be categorical. Any scale predictors that are input to the procedure are temporarily binned into categorical variables for the procedure.

If predictor selection is used, the NAIVEBAYES procedure selects a subset of predictors that “best” predict the dependent variable, based on the training data. The procedure first creates a sequence of subsets, with an increasing number of predictors in each subset. The predictor that is added to each subsequent subset is the predictor that increases the average log-likelihood the most. The procedure uses simulated data to compute the average log-likelihood when the training dataset cannot fit into memory.

The final subset is obtained by using one of two approaches:

  • By default, a maximum subset size is used. This approach creates a sequence of subsets from the initial subset to the maximum-size subset. The “best” subset is chosen by using a BIC-like criterion or a test data criterion.
  • A particular subset size may be used to select the subset with the specified size.

If model building is requested, the NAIVEBAYES procedure classifies cases based on the Naïve Bayes model for the input or selected predictors, depending on whether predictor selection is requested. For a given case, the classification—or predicted category—is the dependent variable category with the highest posterior probability.

The NAIVEBAYES procedure uses the IBM® SPSS® Statistics random number generator in the following two scenarios: (1) if a percentage of cases in the active dataset is randomly assigned to the test dataset, and (2) if the procedure creates simulated data to compute the average log-likelihood when the training records cannot fit into memory. To ensure that the same results are obtained regardless of which scenario is in effect when NAIVEBAYES is invoked repeatedly, specify a seed on the SET command. If a seed is not specified, a default random seed is used, and results may differ across runs of the NAIVEBAYES procedure.

Frequency Weight

If a WEIGHT variable is in effect, its values are used as frequency weights by the NAIVEBAYES procedure.

  • Cases with missing weights or weights that are less than 0.5 are not used in the analyses.
  • The weight values are rounded to the nearest whole numbers before use. For example, 0.5 is rounded to 1, and 2.4 is rounded to 2.

Limitations

SPLIT FILE settings are ignored by the NAIVEBAYES procedure.