Examples for PCA
This example shows how to identify principal components in the CUSTOMER_CHURN data set, store them in the PCA model, and apply this model to additional data.
You can use the created PCA models to predict principal components for new data.
This example creates a PCA model for the customer churn data set.
The SAMPLES.CUSTOMER_CHURN table contains the data that is used. In addition, a table that contains the training data, and a table that contains the testing data are required.
The following call shows how to create these tables from the SAMPLES.CUSTOMER_CHURN table.
CALL IDAX.SPLIT_DATA('intable=samples.customer_churn, traintable=customer_churn_train, testtable=customer_churn_project_pre, id=cust_id, fraction=0.9');
The following call shows how to create a PCA model.
CALL IDAX.PCA('model=customer_churn_pca, intable=customer_churn_train, id=cust_id, coldefrole=ignore, incolumn=DURATION;AVG_SPENT_RETAIN_PM;AVG_SQ_SPENT_RETAIN_PM;ANNUAL_REVENUE_MIL;TOTAL_EMPLOYEES;TOTAL_BUY;TOTAL_BUY_FREQ;TOTAL_BUY_FREQ_SQ, centerData=true, scaleData=true, forceEigenSolve=false, saveScores=true');
Because the PCA function only supports continuous input columns, all non-numeric columns or discrete columns are filtered out. The input data is centered and scaled, and principal component computation is carried out by using singular value decomposition (SVD).
In addition, the scores for each principal component are stored in the PCA model.
The following meta tables are specified through the 'model' argument.
<model_name>_META
<model_name>_PCA
<model_name>_ATTMEAN
<model_name>_ATTSD
<model_name>_ATTSD_DIV
<model_name>_SCORES
<model_name>_SDEV
<model_name>_RCV
The PCA model comprised of these tables can be viewed using the PRINT_MODEL procedure as follows:
Result set 1
--------------
ID PC1 PC2 PC3 PC4 ...
--- ------------------------ ------------------------ ------------------------ ------------------------ ...
1] +1.65727166025521E+000 +1.40976803961328E+000 +1.05428317056634E+000 +1.00661921778098E+000 ...
1 record(s) selected.
Result set 2
--------------
ID PC1 PC2 PC3 PC4 ...
--------------------------- ------------------------ ------------------------ ------------------------ ------------------------ ...
DURATION +5.60318143412660E-001 -2.18647274716772E-002 +5.10079834520816E-002 -1.90423990300730E-001 ...
AVG_SPENT_RETAIN_PM -5.64039266932060E-001 +3.82013477352810E-003 +6.97059518304552E-002 -2.04239857563087E-001 ...
AVG_SQ_SPENT_RETAIN_PM -5.84291068949257E-001 -1.22230471272086E-002 +6.90425476061470E-002 -1.76276233237407E-001 ...
ANNUAL_REVENUE_MIL +8.88768424472973E-002 +4.41767150633714E-002 +6.06883648826900E-001 +1.62114024933152E-001 ...
TOTAL_EMPLOYEES +8.44624028753465E-002 -1.75756296874039E-002 +6.04577189935120E-001 -6.53914336812468E-001 ...
TOTAL_BUY +1.06723780559009E-001 +8.81594438912170E-002 -5.02786941873035E-001 -6.60181838993008E-001 ...
TOTAL_BUY_FREQ +8.87399480111440E-003 -7.03193685540747E-001 -3.25360238114661E-002 -2.95322362404112E-002 ...
TOTAL_BUY_FREQ_SQ -2.35085602543812E-003 -7.03451494800843E-001 -9.88680721058580E-003 -1.88240169390858E-002 ...
8 record(s) selected.
Return Status = 0
This representation of the PCA model allows for an easy comparison to a PCA model that is created in R by using the functions 'prcomp()' and 'print()'.
Another representation of the PCA model is the summary view (analog to R's function 'summary()' that looks as follows:
CALL IDAX.PRINT_PCA('model=customer_churn_pca, mode=summary');
Result set 1
--------------
MEASURE PC1 PC2 PC3 PC4 ...
--------------------------- ------------------------ ------------------------ ------------------------ ------------------------ ...
Standard deviation +1.65727166025521E+000 +1.40976803961328E+000 +1.05428317056634E+000 +1.00661921778098E+000 ...
Proportion of Variance +3.43318669485631E-001 +2.48430740689382E-001 +1.38939125467427E-001 +1.26660281200748E-001 ...
Cumulative Proportion +3.43318669485631E-001 +5.91749410175013E-001 +7.30688535642440E-001 +8.57348816843189E-001 ...
3 record(s) selected.
Return Status = 0
You can predict principal components for new data by using the PROJECT_PCA function and a previously created PCA model. The structure of the input data that is used for prediction, also known as projection for PCA) must be identical to the structure that was used to create the PCA model, that is, both structures must have the same columns.
To use the 'customer_churn_project_pre' table for projection, the following example shows how to transform this table. This step is necessary because the table 'customer_churn_project_pre' contains columns that are not reflected in the PCA model that was created in the previous step. So, before you can apply the PCA model, these columns must be filtered out.
CREATE TABLE customer_churn_project_in AS (SELECT CUST_ID,
DURATION, AVG_SPENT_RETAIN_PM, AVG_SQ_SPENT_RETAIN_PM,
ANNUAL_REVENUE_MIL, TOTAL_EMPLOYEES, TOTAL_BUY,
TOTAL_BUY_FREQ, TOTAL_BUY_FREQ_SQ FROM
customer_churn_project_pre) WITH NO DATA;
INSERT INTO customer_churn_project_in (SELECT CUST_ID,
DURATION, AVG_SPENT_RETAIN_PM, AVG_SQ_SPENT_RETAIN_PM,
ANNUAL_REVENUE_MIL, TOTAL_EMPLOYEES, TOTAL_BUY,
TOTAL_BUY_FREQ, TOTAL_BUY_FREQ_SQ FROM
customer_churn_project_pre);
DROP TABLE customer_churn_project_pre;
CALL IDAX.PROJECT_PCA('model=customer_churn_pca, intable=customer_churn_project_in, outtable=customer_churn_project_out');
In this example, the output table contains one row for each instance from the specified input data set.
The following columns show the principal components:
ID PC1 PC2 PC3 PC4 ...
-------------------- ------------------------ ------------------------ ------------------------ ------------------------ ...