IBM Support

Apply SPSS Logistic Regression results to predict response for new cases

Troubleshooting


Problem

I have run the SPSS Logistic Regression procedure with one data set and wish to apply the results to predict probabilities on the dependent variable (DV) in a new file with the same variables. How can this be accomplished?

Resolving The Problem

Two methods are described below. The first method uses the /SELECT subcommand in the LOGISTIC REGRESSION procedure. It requires you to have the analysis cases and the application cases in the same SPSS data file. The second method involves the use of SPSS transformation commands to compute the predicted response. This method uses the parameter estimates as output by the LOGISTIC REGRESSION procedure for the analysis data set. If the analysis cases and application cases are in the same file or can be merged into 1 file, then use of the /SELECT subcommand is the simpler solution. If merging these data sets is not feasible, then Method 2 can be applied.


Method 1: The /SELECT subcommand:
If you can merge the original analysis file and the new cases into one SPSS data file, with a variable that identifies these two data sources, then you can use the /SELECT subcommand in LOGISTIC REGRESSION to base the analysis on one set of cases but to compute estimated probabilities and response categories for all cases. For example, suppose that the original analysis cases have a value of 1 for the variable DATSET, while the new application cases have DATSET = 2. The LOGISTIC REGRESSION command would look like:

LOGISTIC REGRESSION VAR=dv
/SELECT datset EQ 1
/METHOD=ENTER age edlevel sal jobcat region
/CONTRAST (region)=Indicator
/SAVE PRED (dvprob) PGROUP (dvpred)
/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

The Selection variable and value can be identified in the Logistic Regression dialog box, i.e. the graphic user interface (GUI) for the procedure. The Select button is in the lower left corner of the main Logistic Regression dialog. When you click that button, a new box appears near the bottom of the dialog. Paste the selection variable (DATSET in this example) into the Selection Variable box and click the Rule button. Type the value for the analysis cases (1 in this example) into the Value box in the "Set Rule" dialog box that opens and then click Continue.

If the analysis and application cases are in separate files and the selection variable does not exist in either file, you can create this variable as part of the process of merging the two files. For example:

ADD FILES /FILE=* /IN=datset
/FILE= appdat.sav .
EXECUTE .

In the above ADD command, the analysis file is the current active file. The cases from the application file, appdat.sav, are added to these cases. The /IN subcommand creates the new variable DATSET. For cases from the current file, DATSET is set to 1. For cases from appdat.sav, DATSET is set to 0. This arrangement is determined by the location of the /IN subcommand following the designation of the current file. If the /IN subcommand had followed the name of the application file, then cases from that file would have values of 1 assigned to DATSET. In the GUI, the files can be merged from the
Data->Merge->Add cases menu. In the "Add cases from...: dialog, there is a checkbox titled "Indicate case source as variable". Checking that box will create a variable that indicates the source file for each case. You can replace the default variable name of SOURCE01, but note that the cases from the new file (i.e. the file that was not active in the data editor) will have the value 1 in the source file variable.

Note that the variables to be analyzed must have the same variable names for both data sets. If this is not true for their respective files, you can rename the variables for one set when you merge the files. In the GUI, you can highlight variable names in the "Unpaired Variables" box and rename them before clicking OK in the "Add Cases from..." box. Use the /RENAME subcommand for the ADD FILES command. For example:

ADD FILES /FILE=* /IN=datset
/FILE= appdat.sav
/RENAME (edstatus zone = edlevel region) .
EXECUTE .

In the above example, the application file variables EDSTATUS and ZONE correspond to EDLEVEL and REGION in the analysis file. EDSTATUS is renamed to EDLEVEL and ZONE is renamed to REGION during the merge. A variable can be renamed in the "Add cases from..." dialog by highlighting it in the "Unpaired Variables" box and clicking the Rename button.


Method 2: Applying Regression Coefficients in Transformation Commands:
Suppose that the following Logistic Regression command was run on the model-building dataset:

LOGISTIC REGRESSION VAR=dv
/METHOD=ENTER age edlevel sal jobcat region
/CONTRAST (region)=Indicator
/SAVE PRED (dvprob) PGROUP (dvpred)
/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

Note that REGION is a 4-level categorical predictor. We have chosen the default indicator contrasts for this predictor, so indicator (dummy) variables will be constructed internally for each of the first 3 categories. The last category, i.e. the fourth category, is the reference category by default.

Suppose that the regression coefficients (the 'B' column in the 'Variables in the Equation' table in the Logistic Regression output) from the analysis were as follows:

AGE .219399
EDLEVEL .726163
SAL -.065517
JOBCAT -.023818
REGION
REGION(1) 2.549343
REGION(2) 2.019601
REGION(3) -.281382
Constant -19.581611

Note that there are coefficients for each of the indicator variables that were internally created for REGION. Cases with the fourth value of REGION will have values of 0 for each of the internal indicator variables. There is no overall coefficient for REGION.

You can use the coefficients from the Logistic Regression output to build a set of SPSS syntax commands that will compute predicted log odds, predicted probability of the target event on the DV, and predicted outcome for the cases in the new data file. Once the file with the application cases has been opened in SPSS, you can run these commands. The following example commands are based on the above coefficients.

COMPUTE z = .219399*AGE + .726163*EDLEVEL -.065517*SAL -.023818*JOBCAT
+ 2.549343*(REGION=1) + 2.019601*(REGION=2) -.281382*(REGION=3) -19.581611 .
COMPUTE predprob = 1/(1 + EXP(-z)) .
COMPUTE predcat = (predprob > .5).
EXECUTE.

In the above commands, the coefficients are applied directly to compute the predicted log odds, which are stored in the new variable Z. The value of Z is then used to compute the predicted probability of the target event and store that probability in the new variable PREDPROB. Cases are then predicted to be a 1 on the DV if their predicted probability is greater than .5; 0, otherwise. You can choose a different cutoff if you wish. Note that logical expressions in a COMPUTE command, such as "(REGION=1)" or "(predprob > .5)", return a 1 if true for the case; 0, if false. Note that the first 2 COMPUTE commands can be compressed into a single command, with "z" in the computation of PREDPROB replaced by the numeric expression for the calculation of Z. (Don't forget that the value of Z must be multiplied by -1.) The intermediate command for the calculation of Z was used to help clarify the role of the coefficients in calculating the predicted log odds and the role of the predicted log odds in calculating the predicted probabilities.

If you have the output (.spo) file from the original logistic regression analysis, you can copy the coefficients from the "variables in the Equation" pivot table and paste them into a syntax window, then build the COMPUTE command around these values. This will help avoid errors in entering the values and reduce the tedium. To copy the values from the pivot table, right-click the mouse with the cursor pointing anywhere in the "Variables in the Equation" table. When the Pivot Table Editor opens, highlight the column of coefficients. At this point, it's advisable to open the Formats menu and choose "Cell Properties", then increase the value in the Decimals box of that dialog. I used 6 decimals in the example commands, but 10 or 12 may be preferable as small rounding errors can noticeably affect results. After closing the Cells Properties dialog, copy the highlighted coefficients and paste them into a syntax window. You can then build the computation of z around these values.

Note that you can run each of the COMPUTE commands in the Transform->Compute dialog box, although you may find the manipulation of the copied coefficients to be easier in a syntax window.

[{"Product":{"code":"SSLVMB","label":"IBM SPSS Statistics"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Not Applicable","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

24831

Document Information

Modified date:
16 April 2020

UID

swg21477297