Supported nodes

Many SPSS® Modeler nodes are supported for execution on HDFS, but there may be some differences in the execution of certain nodes, and some are not currently supported. This topic details the current level of support.

General
  • Some characters that are normally acceptable within a quoted Modeler field name will not be accepted by Analytic Server.
  • For a Modeler stream to be run in Analytic Server, it must begin with one or more Analytic Server Source nodes and end in a single modeling node or Analytic Server Export node.
  • It is recommended that you set the storage of continuous targets as real rather than integer. Scoring models always write real values to the output data files for continuous targets, while the output data model for the scores follows the storage of the target. Thus, if a continuous target has integer storage, there will be a mismatch in the written values and the data model for the scores, and this mismatch will cause errors when you attempt to read the scored data.
  • If a field measurement is Geospatial, the function for @OFFSET is not supported.
Source
  • A stream that begins with anything other than an Analytic Server source node will be run locally.
Record operations
All Record operations are supported, with the exception of the Streaming TS and Space-Time-Boxes nodes. Further notes on supported node functionality follow.
Select
  • Supports the same set of functions supported by the Derive node.
Sample
  • Block-level sampling is not supported.
  • Complex Sampling methods are not supported.
  • First n sampling with "Discard sample" is not supported.
  • First n sampling with N>20000 is not supported.
  • 1-in-n sampling is not supported when "Maximum sample size" not set.
  • 1-in-n sampling isnot supported when N * "Maximum sample size" > 20000.
  • Random % block level sampling is not supported.
  • Random % currently supports supplying a seed.
Aggregate
  • Contiguous keys are not supported. If you are reusing an existing stream that is set up to sort the data and then use this setting in the Aggregate node, change the stream to remove the Sort node.
  • Order statistics (Median, 1st Quartile, 3rd Quartile) are computed approximately, and supported through the Optimization tab.
Sort
  • The Optimization tab is not supported.
In a distributed environment, there are a limited number of operations that preserve the record order established by the Sort node.
  • A Sort followed by an Export node produces a sorted data source.
  • A Sort followed by a Sample node with First record sampling returns the first N records.
In general, you should place a Sort node as close as possible to the operations that need the sorted records.
Merge
  • Merge by Order is not supported.
  • The Optimization tab is not supported.
  • Merge operations are relatively slow. If you have available space in HDFS, it can be much faster to merge your data sources once and use the merged source in following streams than to merge the data sources in each stream.
R Transform
The R syntax in the node should consist of record-at-a-time operations.
Field operations
All Field operations are supported, with the exception of the Anonymize, Transpose, Time Intervals, and History nodes. Further notes on supported node functionality follow.
Auto Data Prep
  • Training the node is not supported. Applying the transformations in a trained Auto Data Prep node to new data is supported.
Derive
  • All Derive functions are supported, with the exception of sequence functions.
  • Deriving a new field as a Count is essentially a sequence operation, and thus not supported.
  • Split fields cannot be derived in the same stream that uses them as splits; you will need to create two streams; one that derives the split field and one that uses the field as splits.
Filler
  • Supports the same set of functions supported by the Derive node.
Binning
The following functionality is not supported.
  • Optimal binning
  • Ranks
  • Tiles -> Tiling: Sum of values
  • Tiles -> Ties: Keep in current and Assign randomly
  • Tiles ->Custom N: Values over 100, and any N value where 100 % N is not equal to zero.
RFM Analysis
  • The Keep in current option for handling ties is not supported. RFM recency, frequency, and monetary scores will not always match those computed by Modeler from the same data. The score ranges will be the same but score assignments (bin numbers) may differ by one.
Graphs
All Graph nodes are supported.
Modeling
The following Modeling nodes are supported: Time Series, TCM, Isotonic-AS, Extension Model, Tree-AS, C&R Tree, Quest, CHAID, Linear, Linear-AS, Neural Net, GLE, LSVM, TwoStep-AS, Random Trees, STP, Association Rules, XGBoost-AS, Random Forest, and K-Means-AS. Further notes on those nodes' functionality follow.
Linear
When building models on big data, you will typically want to change the objective to Very large datasets, or specify splits.
  • Continued training of existing PSM models is not supported.
  • The Standard model building objective is only recommended if split fields are defined so that the number of records in each split is not too large, where the definition of "too large" is dependent upon the power of individual nodes in your Hadoop cluster. By contrast, you also need to be careful to ensure that splits are not defined so finely that there are too few records to build a model.
  • The Boosting objective is not supported.
  • The Bagging objective is not supported.
  • The Very large datasets objective is not recommended when there are few records; it will often either not build a model or will build a degraded model.
  • Automatic Data Preparation is not supported. This can cause problems when trying to build a model on data with many missing values; normally these would be imputed as part of automatic data preparation. A workaround would be to use a tree model or a neural network with the Advanced setting to impute missing values selected.
  • The accuracy statistic is not computed for split models.
Neural Net
When building models on big data, you will typically want to change the objective to Very large datasets, or specify splits.
  • Continued training of existing standard or PSM models is not supported.
  • The Standard model building objective is only recommended if split fields are defined so that the number of records in each split is not too large, where the definition of "too large" is dependent upon the power of individual nodes in your Hadoop cluster. By contrast, you also need to be careful to ensure that splits are not defined so finely that there are too few records to build a model.
  • The Boosting objective is not supported.
  • The Bagging objective is not supported.
  • The Very large datasets objective is not recommended when there are few records; it will often either not build a model or will build a degraded model.
  • When there are many missing values in the data, use the Advanced setting to impute missing values.
  • The accuracy statistic is not computed for split models.
C&R Tree, CHAID, and Quest
When building models on big data, you will typically want to change the objective to Very large datasets, or specify splits.
  • Continued training of existing PSM models is not supported.
  • The Standard model building objective is only recommended if split fields are defined so that the number of records in each split is not too large, where the definition of "too large" is dependent upon the power of individual nodes in your Hadoop cluster. By contrast, you also need to be careful to ensure that splits are not defined so finely that there are too few records to build a model.
  • The Boosting objective is not supported.
  • The Bagging objective is not supported.
  • The Very large datasets objective is not recommended when there are few records; it will often either not build a model or will build a degraded model.
  • Interactive sessions is not supported.
  • The accuracy statistic is not computed for split models.
  • When a split field is present, tree models built locally in Modeler are slightly different from tree models built by Analytic Server, and thus produce different scores. The algorithms in both cases are valid; the algorithms used by Analytic Server are simply newer. Given the fact that tree algorithms tend to have many heuristic rules, the difference between the two components is normal.
Model scoring
All models supported for modeling are also supported for scoring. In addition, locally-built model nuggets for the following nodes are supported for scoring: C&RT, Quest, CHAID, Linear, and Neural Net (regardless of whether the model is standard, boosted bagged, or for very large datasets), Regression, C5.0, Logistic, Genlin, GLMM, Cox, SVM, Bayes Net, TwoStep, KNN, Decision List, Discriminant, Self Learning, Anomaly Detection, Apriori, Carma, K-Means, Kohonen, R, and Text Mining.
  • No raw or adjusted propensities will be scored. As a workaround you can get the same effect by manually computing the raw propensity using a Derive node with the following expression: if 'predicted-value' == 'value-of-interest' then 'prob-of-that-value' else 1-'prob-of-that-value' endif
R
The R syntax in the nugget should consist of record-at-a-time operations.
Output
The Matrix, Analysis, Data Audit, Transform, Set Globals, Statistics, Means, and Table nodes are supported. Further notes on supported node functionality follow.
Data Audit
The Data Audit node cannot produce the mode for continuous fields.
Means
The Means node cannot produce a standard error or 95% confidence interval.
Table
The Table node is supported by writing a temporary Analytic Server data source containing the results of upstream operations. The Table node then pages through the contents of that data source.
Export
A stream can begin with an Analytic Server source node and end with an export node other than the Analytic Server export node, but data will move from HDFS to SPSS Modeler Server, and finally to the export location.