Supported nodes
Many SPSS® Modeler nodes are supported for execution on HDFS, but there may be some differences in the execution of certain nodes, and some are not currently supported. This topic details the current level of support.
- General
- Some characters that are normally acceptable within a quoted Modeler field name will not be accepted by Analytic Server.
- For a Modeler stream to be run in Analytic Server, it must begin with one or more Analytic Server Source nodes and end in a single modeling node or Analytic Server Export node.
- It is recommended that you set the storage of continuous targets as real rather than integer. Scoring models always write real values to the output data files for continuous targets, while the output data model for the scores follows the storage of the target. Thus, if a continuous target has integer storage, there will be a mismatch in the written values and the data model for the scores, and this mismatch will cause errors when you attempt to read the scored data.
- If a field measurement is Geospatial, the function for @OFFSET is not supported.
- Source
- A stream that begins with anything other than an Analytic Server source node will be run locally.
- Record operations
- All Record operations are supported, with the exception of the Streaming TS and Space-Time-Boxes
nodes. Further notes on supported node functionality follow.
- Select
-
- Supports the same set of functions supported by the Derive node.
- Sample
-
- Block-level sampling is not supported.
- Complex Sampling methods are not supported.
- First
n
sampling with"Discard sample"
is not supported. - First
n
sampling withN>20000
is not supported. - 1-in-n sampling is not supported when
"Maximum sample size"
not set. - 1-in-n sampling isnot supported when
N * "Maximum sample size" > 20000
. - Random % block level sampling is not supported.
- Random % currently supports supplying a seed.
- Aggregate
-
- Contiguous keys are not supported. If you are reusing an existing stream that is set up to sort the data and then use this setting in the Aggregate node, change the stream to remove the Sort node.
- Order statistics (Median, 1st Quartile, 3rd Quartile) are computed approximately, and supported through the Optimization tab.
- Sort
- The Optimization tab is not supported.
- A Sort followed by an Export node produces a sorted data source.
- A Sort followed by a Sample node with First record sampling returns the first N records.
- Merge
-
- Merge by Order is not supported.
- The Optimization tab is not supported.
- Merge operations are relatively slow. If you have available space in HDFS, it can be much faster to merge your data sources once and use the merged source in following streams than to merge the data sources in each stream.
- R Transform
- The R syntax in the node should consist of record-at-a-time operations.
- Field operations
- All Field operations are supported, with the exception of the Anonymize, Transpose, Time
Intervals, and History nodes. Further notes on supported node functionality follow.
- Auto Data Prep
-
- Training the node is not supported. Applying the transformations in a trained Auto Data Prep node to new data is supported.
- Derive
-
- All Derive functions are supported, with the exception of sequence functions.
- Deriving a new field as a Count is essentially a sequence operation, and thus not supported.
- Split fields cannot be derived in the same stream that uses them as splits; you will need to create two streams; one that derives the split field and one that uses the field as splits.
- Filler
-
- Supports the same set of functions supported by the Derive node.
- Binning
- The following functionality is not supported.
- Optimal binning
- Ranks
- Tiles -> Tiling: Sum of values
- Tiles -> Ties: Keep in current and Assign randomly
- Tiles ->Custom N: Values over 100, and any N value where 100 % N is not equal to zero.
- RFM Analysis
-
- The Keep in current option for handling ties is not supported. RFM recency, frequency, and monetary scores will not always match those computed by Modeler from the same data. The score ranges will be the same but score assignments (bin numbers) may differ by one.
- Graphs
- All Graph nodes are supported.
- Modeling
- The following Modeling nodes are supported: Time Series, TCM, Isotonic-AS, Extension Model,
Tree-AS, C&R Tree, Quest, CHAID, Linear, Linear-AS, Neural Net, GLE, LSVM, TwoStep-AS, Random
Trees, STP, Association Rules, XGBoost-AS, Random Forest, and K-Means-AS. Further notes on those
nodes' functionality follow.
- Linear
- When building models on big data, you will typically want to change the objective to Very large
datasets, or specify splits.
- Continued training of existing PSM models is not supported.
- The Standard model building objective is only recommended if split fields are defined so that the number of records in each split is not too large, where the definition of "too large" is dependent upon the power of individual nodes in your Hadoop cluster. By contrast, you also need to be careful to ensure that splits are not defined so finely that there are too few records to build a model.
- The Boosting objective is not supported.
- The Bagging objective is not supported.
- The Very large datasets objective is not recommended when there are few records; it will often either not build a model or will build a degraded model.
- Automatic Data Preparation is not supported. This can cause problems when trying to build a model on data with many missing values; normally these would be imputed as part of automatic data preparation. A workaround would be to use a tree model or a neural network with the Advanced setting to impute missing values selected.
- The accuracy statistic is not computed for split models.
- Neural Net
- When building models on big data, you will typically want to change the objective to Very large
datasets, or specify splits.
- Continued training of existing standard or PSM models is not supported.
- The Standard model building objective is only recommended if split fields are defined so that the number of records in each split is not too large, where the definition of "too large" is dependent upon the power of individual nodes in your Hadoop cluster. By contrast, you also need to be careful to ensure that splits are not defined so finely that there are too few records to build a model.
- The Boosting objective is not supported.
- The Bagging objective is not supported.
- The Very large datasets objective is not recommended when there are few records; it will often either not build a model or will build a degraded model.
- When there are many missing values in the data, use the Advanced setting to impute missing values.
- The accuracy statistic is not computed for split models.
- C&R Tree, CHAID, and Quest
- When building models on big data, you will typically want to change the objective to Very large
datasets, or specify splits.
- Continued training of existing PSM models is not supported.
- The Standard model building objective is only recommended if split fields are defined so that the number of records in each split is not too large, where the definition of "too large" is dependent upon the power of individual nodes in your Hadoop cluster. By contrast, you also need to be careful to ensure that splits are not defined so finely that there are too few records to build a model.
- The Boosting objective is not supported.
- The Bagging objective is not supported.
- The Very large datasets objective is not recommended when there are few records; it will often either not build a model or will build a degraded model.
- Interactive sessions is not supported.
- The accuracy statistic is not computed for split models.
- When a split field is present, tree models built locally in Modeler are slightly different from tree models built by Analytic Server, and thus produce different scores. The algorithms in both cases are valid; the algorithms used by Analytic Server are simply newer. Given the fact that tree algorithms tend to have many heuristic rules, the difference between the two components is normal.
- Model scoring
- All models supported for modeling are also supported for scoring. In addition, locally-built
model nuggets for the following nodes are supported for scoring: C&RT, Quest, CHAID, Linear, and
Neural Net (regardless of whether the model is standard, boosted bagged, or for very large
datasets), Regression, C5.0, Logistic, Genlin, GLMM, Cox, SVM, Bayes Net, TwoStep, KNN, Decision
List, Discriminant, Self Learning, Anomaly Detection, Apriori, Carma, K-Means, Kohonen, R, and Text Mining.
- No raw or adjusted propensities will be scored. As a workaround you can get the same effect by manually computing the raw propensity using a Derive node with the following expression: if 'predicted-value' == 'value-of-interest' then 'prob-of-that-value' else 1-'prob-of-that-value' endif
- Output
- The Matrix, Analysis, Data Audit, Transform, Set Globals, Statistics, Means, and Table nodes are supported. Further notes on supported node functionality follow.
- Export
- A stream can begin with an Analytic Server source node and end with an export node other than the Analytic Server export node, but data will move from HDFS to SPSS Modeler Server, and finally to the export location.