Feature Tree Criteria (Twostep-AS Cluster)
These settings determine how the cluster feature tree is built. By building a cluster feature tree and summarizing the records, the TwoStep algorithm can analyze large data files. In other words, TwoStep Cluster uses a cluster feature tree to build clusters, enabling it to process many cases.
Distance Measure
This selection determines how the similarity between two clusters is computed.
- Log-likelihood
-
The likelihood measure places a probability distribution on the fields. Continuous fields are assumed to be normally distributed, while categorical fields are assumed to be multinomial. All fields are assumed to be independent.
- Euclidean
-
The Euclidean measure is the "straight line" distance between two clusters. Squared Euclidean measure and the Ward method are used to compute similarity between clusters. It can be used only when all of the fields are continuous.
Outlier Clusters
- Include outlier clusters
- Include clusters for cases that are outliers from the regular clusters. If
this option is not selected, all cases are included in regular clusters.
- Number of cases in feature tree leaf is less than.
- If the number of cases in the feature tree leaf is less than the specified value, the leaf is considered an outlier. The value must be an integer greater than 1. If you change this value, higher values are likely to result in more outlier clusters.
- Top percentage of outliers.
- When the cluster model is built, outliers are ranked by outlier strength. The outlier strength that is required to be in the top percentage of outliers is used as the threshold for determining whether a case is classified as an outlier. Higher values mean that more cases are classified as outliers. The value must be between 1 - 100.
Additional settings
- Initial distance change threshold
-
The initial threshold that is used to grow the cluster feature tree. If insertion of a leaf into a leaf of the tree yields tightness less than this threshold, the leaf is not split. If the tightness exceeds this threshold, the leaf is split.
- Leaf node maximum branches
-
The maximum number of child nodes that a leaf node can have.
- Non-leaf node maximum branches
-
The maximum number of child nodes that a non-leaf node can have.
- Maximum tree depth
-
The maximum number of levels that the cluster tree can have.
- Adjustment weight on measurement level
- Reduces the influence of categorical fields by increasing the weight for continuous fields. This value represents a denominator for reducing the weight for categorical fields. So a default of 6, for example, gives categorical feilds a weight of 1/6.
- Memory allocation
-
The maximum amount of memory in megabytes (MB) that the cluster algorithm uses. If the procedure exceeds this maximum, it uses the disk to store information that does not fit in memory.
- Delayed split
- Delay rebuilding of the cluster feature tree. The clustering algorithm rebuilds the cluster feature tree multiple times as it evaluates new cases. This option can improve performance by delaying that operation and reducing the number of times the tree is rebuilt.