GenLin Node Expert Options
If you have detailed knowledge of generalized linear models, expert options allow you to fine-tune the training process. To access expert options, set Mode to Expert on the Expert tab.
Target Field Distribution and Link Function
Distribution.
This selection specifies the distribution of the dependent variable. The ability to specify a non-normal distribution and non-identity link function is the essential improvement of the generalized linear model over the general linear model. There are many possible distribution-link function combinations, and several may be appropriate for any given dataset, so your choice can be guided by a priori theoretical considerations or which combination seems to fit best.
- Binomial. This distribution is appropriate only for variables that represent a binary response or number of events.
- Gamma. This distribution is appropriate for variables with positive scale values that are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis.
- Inverse Gaussian. This distribution is appropriate for variables with positive scale values that are skewed toward larger positive values. If a data value is less than or equal to 0 or is missing, then the corresponding case is not used in the analysis.
- Negative binomial. This distribution can be thought of as the number of trials required to observe k successes and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis. The fixed value of the negative binomial distribution's ancillary parameter can be any number greater than or equal to 0. When the ancillary parameter is set to 0, using this distribution is equivalent to using the Poisson distribution.
- Normal. This is appropriate for scale variables whose values take a symmetric, bell-shaped distribution about a central (mean) value. The dependent variable must be numeric.
- Poisson. This distribution can be thought of as the number of occurrences of an event of interest in a fixed period of time and is appropriate for variables with non-negative integer values. If a data value is non-integer, less than 0, or missing, then the corresponding case is not used in the analysis.
- Tweedie. This distribution is appropriate for variables that can be represented by Poisson mixtures of gamma distributions; the distribution is "mixed" in the sense that it combines properties of continuous (takes non-negative real values) and discrete distributions (positive probability mass at a single value, 0). The dependent variable must be numeric, with data values greater than or equal to zero. If a data value is less than zero or missing, then the corresponding case is not used in the analysis. The fixed value of the Tweedie distribution's parameter can be any number greater than one and less than two.
- Multinomial. This distribution is appropriate for variables that represent an ordinal response. The dependent variable can be numeric or string, and it must have at least two distinct valid data values.
Link Functions.
The link function is a transformation of the dependent variable that allows estimation of the model. The following functions are available:
- Identity. f(x)=x. The dependent variable is not transformed. This link can be used with any distribution.
- Complementary log-log. f(x)=log(−log(1−x)). This is appropriate only with the binomial distribution.
- Cumulative Cauchit. f(x) = tan(π (x – 0.5)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution.
- Cumulative complementary log-log. f(x)=ln(−ln(1−x)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution.
- Cumulative logit. f(x)=ln(x / (1−x)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution.
- Cumulative negative log-log. f(x)=−ln(−ln(x)), applied to the cumulative probability of each category of the response. This is appropriate only with the multinomial distribution.
- Cumulative probit. f(x)=Φ−1(x), applied to the cumulative probability of each category of the response, where Φ−1 is the inverse standard normal cumulative distribution function. This is appropriate only with the multinomial distribution.
- Log. f(x)=log(x). This link can be used with any distribution.
- Log complement. f(x)=log(1−x). This is appropriate only with the binomial distribution.
- Logit. f(x)=log(x / (1−x)). This is appropriate only with the binomial distribution.
- Negative binomial. f(x)=log(x / (x+k −1)), where k is the ancillary parameter of the negative binomial distribution. This is appropriate only with the negative binomial distribution.
- Negative log-log. f(x)=−log(−log(x)). This is appropriate only with the binomial distribution.
- Odds power. f(x)=[(x/(1−x))α−1]/α, if α ≠ 0. f(x)=log(x), if α=0. α is the required number specification and must be a real number. This is appropriate only with the binomial distribution.
- Probit. f(x)=Φ−1(x), where Φ−1 is the inverse standard normal cumulative distribution function. This is appropriate only with the binomial distribution.
- Power. f(x)=x α, if α ≠ 0. f(x)=log(x), if α=0. α is the required number specification and must be a real number. This link can be used with any distribution.
Parameters. The controls in this group allow you to specify parameter values when certain distribution options are chosen.
- Parameter for negative binomial. For negative binomial distribution, choose either to specify a value or to allow the system to provide an estimated value.
-
Parameter for Tweedie. For Tweedie distribution, specify a number between 1.0 and 2.0 for the
fixed value.
Parameter Estimation. The controls in this group allow you to specify estimation methods and to provide initial values for the parameter estimates.
- Method. You can select a parameter estimation method. Choose between Newton-Raphson, Fisher scoring, or a hybrid method in which Fisher scoring iterations are performed before switching to the Newton-Raphson method. If convergence is achieved during the Fisher scoring phase of the hybrid method before the maximum number of Fisher iterations is reached, the algorithm continues with the Newton-Raphson method.
- Scale Parameter Method. You can select the scale parameter estimation method. Maximum-likelihood jointly estimates the scale parameter with the model effects; note that this option is not valid if the response has a negative binomial, Poisson, or binomial distribution . The deviance and Pearson chi-square options estimate the scale parameter from the value of those statistics. Alternatively, you can specify a fixed value for the scale parameter.
- Covariance matrix. The model-based estimator is the negative of the generalized inverse of the Hessian matrix. The robust (also called the Huber/White/sandwich) estimator is a "corrected" model-based estimator that provides a consistent estimate of the covariance, even when the specification of the variance and link functions is incorrect.
Iterations. These options allow you to control the parameters for model convergence. See the topic Generalized Linear Models Iterations for more information.
Output. These options allow you to request additional statistics that will be displayed in the advanced output of the model nugget built by the node. See the topic Generalized Linear Models Advanced Output for more information.
Singularity Tolerance. Singular (or non-invertible) matrices have linearly dependent columns, which can cause serious problems for the estimation algorithm. Even near-singular matrices can lead to poor results, so the procedure will treat a matrix whose determinant is less than the tolerance as singular. Specify a positive value.