MTX_PCA - Principal Component Analysis

This procedure performs a Principal Component Analysis (PCA) using data stored in a matrix.

Usage

The MTX_PCA stored procedure has the following syntax:
MTX_PCA(modelName,dataMatrixName,forceSufficientStats,centerData,scaleData,saveScores)
Parameters
modelName
The name of the created model.
Type NVARCHAR(ANY)
dataMatrixName
The name of the matrix containing the data.
Type: NVARCHAR(ANY)
forceSufficientStats
Specifies whether the PCA should be based on a covariance matrix even if SVD can be performed.
Type: BOOLEAN
Default: FALSE
centerData
Specifies whether the model should include data centering, that is, subtraction of the mean estimator.
Type: BOOLEAN
DEFAULT: TRUE
scaleData
Specifies whether the model should include data scaling, which is division by a non-zero standard deviation estimator. When data scaling is performed the resulting PCA model is equivalent to a model based on the correlation matrix.
Type: BOOLEAN
Default: TRUE
saveScores
Specifies whether the PCA scores of individual observation are to be saved.
Type: BOOLEAN
Default: FALSE
Returns
BOOLEAN TRUE always.

Details

This procedure constructs a PCA model of the data and provides a corresponding transformation into principal components, which can then be applied using MTX_PCA_APPLY. Input data should be provided as Database Matrix Objects, with observations provided in rows, and attributes in columns.

The PCA can be constructed using two strategies: SVD decomposition, which is more accurate but at the expense of speed and memory, or by finding the eigenvectors of the unbiased covariance matrix estimator. If the parameter forceSufficientStats is not TRUE, the best strategy, that is, the one providing the most accurate solution based on data size and memory availability, is used. Based on the specified parameters, the data matrix can be centered and scaled. In that case the corresponding parameters, the mean and variance estimators are calculated and become part of the model. When included in the model, centering and scaling is also performed during the application step.

Data centering (assuring that mean of each attribute is equal to 0) is an assumption of PCA method -failing to meet it usually causes serious model degradation. Data scaling (assuring that the variance of each attribute is equal to 1) usually provides better approximation of the data in case of the presence of attributes that differ in orders of magnitude. It is equivalent to perform the PCA using the correlation instead of covariance matrix.

In order to express the model being created, the procedure creates a set of matrices, using the modelName parameter as the prefix for given matrix name. The set consists of following matrices:
  • {prefix}_PCA_ATTMEAN: row vector containing mean values of the attributes (when centerData is TRUE)
  • {prefix}_PCA_ATTSD: row vector containing standard deviations of the attributes (when scaleData is TRUE)
  • {prefix}_PCA_ATTSD_DIV: row vector containing reciprocals of non-zero standard deviations of the attributes or value 1 (when scaleData is TRUE)
  • prefix}_PCA_SDEV: row vector containing standard deviations of the principal components
  • {prefix}_PCA: the matrix of loadings (a matrix whose columns contain the eigenvectors of the covariance matrix)
  • {prefix}_PCA_SCORES: the matrix of scores containing projections of individual observations to principal components (when saveScores is TRUE)

Examples

call nzm..shape('1,2,3,4,5,6,7,8,9',1,3,'PCA_TEST'); 
call nzm..shape('9,8,7,6,5,4,3,2,1',10,1,'PCA_TEST_SOURCE_PRE');
---expected value is 0.0
call nzm..SCALAR_OPERATION('PCA_TEST_SOURCE_PRE',
 'PCA_TEST_SOURCE', '-', 0.5);
call  nzm..gemm('PCA_TEST_SOURCE','PCA_TEST', 'PCA_TEST_VALS'); 
call nzm..mtx_pca('PCA_TEST_MOD','PCA_TEST_VALS',FALSE,FALSE,
 FALSE, TRUE);
call nzm..list_matrices();
    std dev in each direction (in this example real value of all
    components other than the first one should be 0)
call nzm..print('PCA_TEST_MOD_PCA_SDEV'); 
call nzm..print('PCA_TEST_MOD_PCA_SCORES');
---projecting on the original value (first column)
call  nzm..gemm_large('PCA_TEST_VALS',FALSE,'PCA_TEST_MOD_PCA',
 FALSE,'PCA_TEST_PROJ');
    resulting value (first column of PCA_TEST_PROJ) is 
    proportional to original one (PCA_TEST_VALS): 
    PCA_TEST_PROJ[1,] ~~PCA_TEST_SOURCE*sqrt(nzm..red_ssq('PCA_TEST'))
call nzm..delete_matrix_like('PCA\_TEST%');

 SHAPE
-------
 t
(1 row)

 SHAPE
-------
 t
(1 row)

 SCALAR_OPERATION
------------------
 t
(1 row)

 GEMM
------
 t
(1 row)

 MTX_PCA
---------
 t
(1 row)

                   LIST_MATRICES
----------------------------------------------------
 PCA_TEST
 PCA_TEST_MOD_PCA
 PCA_TEST_MOD_PCA_SCORES
 PCA_TEST_MOD_PCA_SDEV
 PCA_TEST_SOURCE
 PCA_TEST_SOURCE_PRE
 PCA_TEST_VALS
(1 row)

                      PRINT
----------------------------------------------------
-- matrix: PCA_TEST_MOD_PCA_SDEV --
 22.118368434905
 2.4603199788269e-16
 9.9446202776076e-17
(1 row)

                                     PRINT
-----------------------------------------------------------------------------------------
 -- matrix: PCA_TEST_MOD_PCA_SCORES --
 31.804087787578, -1.4567015001103e-16, 1.2617124776816e-16
 -28.062430400805, -4.1645192033268e-17, -2.6226789760277e-16 
 -24.320773014031, -3.6092499762165e-17, 3.1261282628732e-17 
 -20.579115627257, -3.0539807491063e-17, 2.6451854532004e-17 
 -16.837458240483, -2.4987115219961e-17, 2.1642426435276e-17 
 -13.095800853709, 7.1866157069922e-16, 1.6832998338548e-17 
 -9.3541434669349, -1.3881730677756e-17, 1.202357024182e-17 
 -5.6124860801609, -8.3290384066535e-18, 7.2141421450919e-18 
 -1.870828693387, -2.7763461355512e-18, 2.404714048364e-18
 -31.804087787578, -4.719788430437e-17, 4.0880138822188e-17
(1 row)

 GEMM_LARGE
------------
 t
(1 row)

 DELETE_MATRIX_LIKE
--------------------
 t
(1 row)