Analysis Node Analysis Tab
The Analysis tab enables you to specify the details of the analysis.
Coincidence matrices (for symbolic or categorical targets). Shows the pattern of matches between each generated (predicted) field and its target field for categorical targets (either flag, nominal, or ordinal). A table is displayed with rows defined by actual values and columns defined by predicted values, with the number of records having that pattern in each cell. This is useful for identifying systematic errors in prediction. If there is more than one generated field related to the same output field but produced by different models, the cases where these fields agree and disagree are counted and the totals are displayed. For the cases where they agree, another set of correct/wrong statistics is displayed.
Performance evaluation. Shows performance evaluation statistics for models with categorical outputs. This statistic, reported for each category of the output field(s), is a measure of the average information content (in bits) of the model for predicting records belonging to that category. It takes the difficulty of the classification problem into account, so accurate predictions for rare categories will earn a higher performance evaluation index than accurate predictions for common categories. If the model does no better than guessing for a category, the performance evaluation index for that category will be 0.
Evaluation metrics (AUC & Gini, binary classifiers only). For binary classifiers, this options reports the AUC (area under curve) and Gini coefficient evaluation metrics. Both of these evaluation metrics are calculated together for each binary model. The values of the metrics are reported in a table in the analysis output browser.
The AUC evaluation metric is calculated as the area under an ROC (receiver operator characteristic) curve, and is a scalar representation of the expected performance of a classifier. The AUC is always between 0 and 1, with a higher number representing a better classifier. A diagonal ROC curve between the coordinates (0,0) and (1,1) represents a random classifier, and has an AUC of 0.5. Therefore, a realistic classifier will not have and AUC of less than 0.5.
The Gini coefficient evaluation metric is sometimes used as an alternative evaluation metric to the AUC, and the two measures are closely related. The Gini coefficient is calculated as twice the area between the ROC curve and the diagonal, or as Gini = 2AUC - 1. The Gini coefficient is always between 0 and 1, with a higher number representing a better classifier. The Gini coefficient is negative in the unlikely event that the ROC curve is below the diagonal.
Confidence figures (if available). For models that generate a confidence field, this option reports statistics on the confidence values and their relationship to predictions. There are two settings for this option:
- Threshold for. Reports the confidence level above which the accuracy will be the specified percentage.
- Improve accuracy. Reports the confidence level above which the accuracy is improved by the specified factor. For example, if the overall accuracy is 90% and this option is set to 2.0, the reported value will be the confidence required for 95% accuracy.
Find predicted/predictor fields using. Determines how predicted fields are matched to the original target field.
- Model output field metadata. Matches predicted fields to the target based on model field information, allowing a match even if a predicted field has been renamed. Model field information can also be accessed for any predicted field from the Values dialog box using a Type node. See the topic Using the Values Dialog Box for more information.
- Field name format. Matches fields based on the naming convention. For example predicted values generated by a C5.0 model nugget for a target named response must be in a field named $C-response.
Separate by partition. If a partition field is used to split records into training, test, and validation samples, select this option to display results separately for each partition. See the topic Partition Node for more information.
Note: When separating by partition, records with null values in the partition field are excluded from the analysis. This will never be an issue if a Partition node is used, since Partition nodes do not generate null values.
User defined analysis. You can specify your own analysis calculation to be used in evaluating your model(s). Use CLEM expressions to specify what should be computed for each record and how to combine the record-level scores into an overall score. Use the functions @TARGET and @PREDICTED to refer to the target (actual output) value and the predicted value, respectively.
- If. Specify a conditional expression if you need to use different calculations depending on some condition.
- Then. Specify the calculation if the If condition is true.
- Else. Specify the calculation if the If condition is false.
- Use. Select a statistic to compute an overall score from the individual scores.
Break down analysis by fields. Shows the categorical fields available for breaking down the analysis. In addition to the overall analysis, a separate analysis will be reported for each category of each breakdown field.