Overview (NAIVEBAYES command)
The NAIVEBAYES
procedure can be used in three ways:
- Predictor selection followed by model building. The procedure submits a set of predictor variables and selects a smaller subset. Based on the Naïve Bayes model for the selected predictors, the procedure then classifies cases.
- Predictor selection only. The procedure selects a subset of predictors for use in subsequent predictive modeling procedures but does not report classification results.
- Model building only. The procedure fits the Naïve Bayes classification model by using all input predictors.
NAIVEBAYES
is available for
categorical dependent variables only and is not intended for use with
a very large number of predictors.
Options
Methods.
The NAIVEBAYES
procedure performs
predictor selection followed by model building, or the procedure performs
predictor selection only, or the procedure performs model building
only.
Training
and test data. NAIVEBAYES
optionally divides the dataset into training and test samples.
Predictor selection uses the training data to compute the underlying
model, and either the training or the test data can be used to determine
the “best” subset of predictors. If the dataset is
partitioned, classification results are given for both the training
and test samples. Otherwise, results are given for the training sample
only.
Binning. The procedure automatically distributes scale predictors into 10 bins, but the number of bins can be changed.
Memory allocation. The NAIVEBAYES
procedure automatically allocates
128MB of memory for storing training records when computing average
log-likelihoods. The amount of memory that is allocated for this
task can be modified.
Timer. The procedure automatically limits processing time to 5 minutes, but a different time limit can be specified.
Maximum or exact subset size. Either a maximum or an exact size can be specified for the subset of selected predictors. If a maximum size is used, the procedure creates a sequence of subsets, from an initial (smaller) subset to the maximum-size subset. The procedure then selects the “best” subset from this sequence.
Missing values. Cases with missing values for the dependent variable or for all
predictors are excluded. The NAIVEBAYES
procedure has an option for treating user-missing values of categorical
variables as valid. User-missing values of scale variables are always
treated as invalid.
Output. NAIVEBAYES
displays pivot table output by default but offers an option for
suppressing most such output. The procedure displays the lists of
selected categorical and scale predictors in a text block. These
lists can be copied for use in subsequent modeling procedures. The NAIVEBAYES
procedure also optionally saves
predicted values and probabilities based on the Naïve Bayes model.
Basic Specification
The basic specification is the NAIVEBAYES
command followed by a dependent variable.
By default, NAIVEBAYES
treats all variables —
except the dependent variable and the weight variable if it is defined
— as predictors, with the dictionary setting of each predictor
determining its measurement level. NAIVEBAYES
selects the “best” subset of predictors (based on
the Naïve Bayes model) and then classifies cases by using the
selected predictors. User-missing values are excluded and pivot table
output is displayed by default.
Syntax Rules
- All subcommands are optional.
- Subcommands may be specified in any order.
- Only a single instance of each subcommand is allowed.
- An error occurs if a keyword is specified more than once within a subcommand.
- Parentheses, equal signs, and slashes that are shown in the syntax chart are required.
- The command name, subcommand names, and keywords must be spelled in full.
- Empty subcommands are not honored.
Operations
The NAIVEBAYES
procedure automatically
excludes cases and predictors with any of the following properties:
- Cases with a missing value for the dependent variable.
- Cases with missing values for all predictors.
- Predictors with missing values for all cases.
- Predictors with the same value for all cases.
The NAIVEBAYES
procedure
requires predictors to be categorical. Any scale predictors that are
input to the procedure are temporarily binned into categorical variables
for the procedure.
If predictor selection is used, the NAIVEBAYES
procedure selects a subset of
predictors that “best” predict the dependent variable,
based on the training data. The procedure first creates a sequence
of subsets, with an increasing number of predictors in each subset.
The predictor that is added to each subsequent subset is the predictor
that increases the average log-likelihood the most. The procedure
uses simulated data to compute the average log-likelihood when the
training dataset cannot fit into memory.
The final subset is obtained by using one of two approaches:
- By default, a maximum subset size is used. This approach creates a sequence of subsets from the initial subset to the maximum-size subset. The “best” subset is chosen by using a BIC-like criterion or a test data criterion.
- A particular subset size may be used to select the subset with the specified size.
If model building is requested,
the NAIVEBAYES
procedure classifies
cases based on the Naïve Bayes model for the input or selected
predictors, depending on whether predictor selection is requested.
For a given case, the classification—or predicted category—is
the dependent variable category with the highest posterior probability.
The NAIVEBAYES
procedure
uses the IBM® SPSS® Statistics random number
generator in the following two scenarios: (1) if a percentage of
cases in the active dataset is randomly assigned to the test dataset,
and (2) if the procedure creates simulated data to compute the average
log-likelihood when the training records cannot fit into memory.
To ensure that the same results are obtained regardless of which scenario
is in effect when NAIVEBAYES
is
invoked repeatedly, specify a seed on the SET
command. If a seed is not specified, a default random
seed is used, and results may differ across runs of the NAIVEBAYES
procedure.
Frequency Weight
If a WEIGHT
variable is in
effect, its values are used as frequency weights by the NAIVEBAYES
procedure.
- Cases with missing weights or weights that are less than 0.5 are not used in the analyses.
- The weight values are rounded to the nearest whole numbers before use. For example, 0.5 is rounded to 1, and 2.4 is rounded to 2.
Limitations
SPLIT FILE
settings are ignored
by the NAIVEBAYES
procedure.