XGBoost-AS node Build Options

Use the Build Options tab to specify build options for the XGBoost-AS node, including general options for model building and handling imbalanced datasets, learning task options for objectives and evaluation metrics, and booster parameters for specific boosters. For more information about these options, see the following online resources:

General

Number of Workers. Number of workers used to train the XGBoost model.

Number of Threads. Number of threads used per worker.

Use External Memory. Whether to use external memory as cache.

Booster Type. The booster to use (gbtree, gblinear, or dart).

Booster Rounds Number. The number of rounds for boosting.

Scale pos weight. This setting controls the balance of positive and negative weights, and is useful for unbalanced classes.

Random Seed. Click Generate to generate the seed used by the random number generator.

Learning Task

Objective. Select from the following learning task objective types: reg:linear, reg:logistic, reg:gamma, reg:tweedie, rank:pairwise, binary:logistic, or multi.

Evaluation Metrics. Evaluation metrics for validation data. A default metric will be assigned according to the objective (rmse for regression, error for classification, or mean average precision for ranking). Available options are rmse, mae, logloss, error, merror, mlogloss, uac, ndcg, map, or gamma-deviance (default is rmse).

Booster Parameters

Lambda. L2 regularization term on weights. Increasing this value will make the model more conservative.

Alpha. L1 regularization term on weights Increasing this value will make model more conservative.

Lambda bias. L2 regularization term on bias. (There is no L1 regularization term on bias because it is not important.)

Tree method. Select the XGBoost tree construction algorithm to use.

Max depth. Specify the maximum depth for trees. Increasing this value will make the model more complex and likely to be overfitting.

Min child weight. Specify the minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than this Min child weight, then the building process will stop further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed in each node. The larger the weight, the more conservative the algorithm will be.

Max delta step. Specify the maximum delta step to allow for each tree's weight estimation. If set to 0, there is no constraint. If set to a positive value, it can help the update step be more conservative. Usually this parameter is not needed, but it may help in logistic regression when a class is extremely imbalanced.

Sub sample. Sub sample is the ratio of the training instance. For example, if you set this to 0.5, XGBoost will randomly collect half the data instances to grow trees and this will prevent overfitting.

Eta. The step size shrinkage used during the update step to prevent overfitting. After each boosting step, the weights of new features can be obtained directly. Eta also shrinks the feature weights to make the boosting process more conservative.

Gamma. The minimum loss reduction required to make a further partition on a leaf node of the tree. The larger the gamma setting, the more conservative the algorithm will be.

Colsample by tree. Sub sample ratio of columns when constructing each tree.

Colsample by level. Sub sample ratio of columns for each split, in each level.

Normalization Algorithm. The normalization algorithm to use when the dart booster type is selected under General options. Available options are tree or forest (default is tree).

Sampling Algorithm. The sampling algorithm to use when the dart booster type is selected under General options. The uniform algorithm uniformly selects dropped trees. The weighted algorithm selects dropped trees in proportion to weight. The default is uniform.

Dropout Rate. The dropout rate to use when the dart booster type is selected under General options.

Probability of Skip Dropout. The skip dropout probability to use when the dart booster type is selected under General options. If a dropout is skipped, new trees are added in the same manner as gbtree.

The following table shows the relationship between the settings in the SPSS® Modeler XGBoost-AS node dialog and the XGBoost Spark parameters.
Table 1. Node properties mapped to Spark parameters
SPSS Modeler setting Script name (property name) XGBoost Spark parameter
Target target_fields
Predictors input_fields
Lambda lambda lambda
Number of Workers nWorkers nWorkers
Number of Threads numThreadPerTask numThreadPerTask
Use External Memory useExternalMemory useExternalMemory
Booster Type boosterType boosterType
Boosting Round Number numBoostRound round
Scale Pos Weight scalePosWeight scalePosWeight
Objective objectiveType objective
Evaluation Metrics evalMetric evalMetric
Lambda lambda lambda
Alpha alpha alpha
Lambda bias lambdaBias lambdaBias
Tree Method treeMethod treeMethod
Max Depth maxDepth maxDepth
Min child weight minChildWeight minChildWeight
Max delta step maxDeltaStep maxDeltaStep
Sub sample sampleSize sampleSize
Eta eta eta
Gamma gamma gamma
Colsample by tree colsSampleRation colSampleByTree
Colsample by level colsSampleLevel colsSampleLevel
Normalization Algorithm normalizeType normalizeType
Sampling Algorithm sampleType sampleType
Dropout Rate rateDrop rateDrop
Probability of Skip Dropout skipDrop skipDrop

1 "Scalable and Flexible Gradient Boosting." Web. © 2015-2016 DMLC.

2 "XGBoost Parameters" Scalable and Flexible Gradient Boosting. Web. © 2015-2016 DMLC.

3 "ml.dmlc.xgboost4j.scala.spark Params." DMLC for Scalable and Reliable Machine Learning. Web. 3 Oct 2017.