MTX_PCA - Principal Component Analysis
This procedure performs a Principal Component Analysis (PCA) using data stored in a matrix.
Usage
- MTX_PCA(modelName,dataMatrixName,forceSufficientStats,centerData,scaleData,saveScores)
- Parameters
- modelName
- The name of the created model.
- dataMatrixName
- The name of the matrix containing the data.
- forceSufficientStats
- Specifies whether the PCA should be based on a covariance matrix even if SVD can be performed.
- centerData
- Specifies whether the model should include data centering, that is, subtraction of the mean estimator.
- scaleData
- Specifies whether the model should include data scaling, which is division by a non-zero standard deviation estimator. When data scaling is performed the resulting PCA model is equivalent to a model based on the correlation matrix.
- saveScores
- Specifies whether the PCA scores of individual observation are to be saved.
Details
This procedure constructs a PCA model of the data and provides a corresponding transformation into principal components, which can then be applied using MTX_PCA_APPLY. Input data should be provided as Database Matrix Objects, with observations provided in rows, and attributes in columns.
The PCA can be constructed using two strategies: SVD decomposition, which is more accurate but at the expense of speed and memory, or by finding the eigenvectors of the unbiased covariance matrix estimator. If the parameter forceSufficientStats is not TRUE, the best strategy, that is, the one providing the most accurate solution based on data size and memory availability, is used. Based on the specified parameters, the data matrix can be centered and scaled. In that case the corresponding parameters, the mean and variance estimators are calculated and become part of the model. When included in the model, centering and scaling is also performed during the application step.
Data centering (assuring that mean of each attribute is equal to 0) is an assumption of PCA method -failing to meet it usually causes serious model degradation. Data scaling (assuring that the variance of each attribute is equal to 1) usually provides better approximation of the data in case of the presence of attributes that differ in orders of magnitude. It is equivalent to perform the PCA using the correlation instead of covariance matrix.
- {prefix}_PCA_ATTMEAN: row vector containing mean values of the attributes (when centerData is TRUE)
- {prefix}_PCA_ATTSD: row vector containing standard deviations of the attributes (when scaleData is TRUE)
- {prefix}_PCA_ATTSD_DIV: row vector containing reciprocals of non-zero standard deviations of the attributes or value 1 (when scaleData is TRUE)
- prefix}_PCA_SDEV: row vector containing standard deviations of the principal components
- {prefix}_PCA: the matrix of loadings (a matrix whose columns contain the eigenvectors of the covariance matrix)
- {prefix}_PCA_SCORES: the matrix of scores containing projections of individual observations to principal components (when saveScores is TRUE)
Examples
call nzm..shape('1,2,3,4,5,6,7,8,9',1,3,'PCA_TEST');
call nzm..shape('9,8,7,6,5,4,3,2,1',10,1,'PCA_TEST_SOURCE_PRE');
---expected value is 0.0
call nzm..SCALAR_OPERATION('PCA_TEST_SOURCE_PRE',
'PCA_TEST_SOURCE', '-', 0.5);
call nzm..gemm('PCA_TEST_SOURCE','PCA_TEST', 'PCA_TEST_VALS');
call nzm..mtx_pca('PCA_TEST_MOD','PCA_TEST_VALS',FALSE,FALSE,
FALSE, TRUE);
call nzm..list_matrices();
std dev in each direction (in this example real value of all
components other than the first one should be 0)
call nzm..print('PCA_TEST_MOD_PCA_SDEV');
call nzm..print('PCA_TEST_MOD_PCA_SCORES');
---projecting on the original value (first column)
call nzm..gemm_large('PCA_TEST_VALS',FALSE,'PCA_TEST_MOD_PCA',
FALSE,'PCA_TEST_PROJ');
resulting value (first column of PCA_TEST_PROJ) is
proportional to original one (PCA_TEST_VALS):
PCA_TEST_PROJ[1,] ~~PCA_TEST_SOURCE*sqrt(nzm..red_ssq('PCA_TEST'))
call nzm..delete_matrix_like('PCA\_TEST%');
SHAPE
-------
t
(1 row)
SHAPE
-------
t
(1 row)
SCALAR_OPERATION
------------------
t
(1 row)
GEMM
------
t
(1 row)
MTX_PCA
---------
t
(1 row)
LIST_MATRICES
----------------------------------------------------
PCA_TEST
PCA_TEST_MOD_PCA
PCA_TEST_MOD_PCA_SCORES
PCA_TEST_MOD_PCA_SDEV
PCA_TEST_SOURCE
PCA_TEST_SOURCE_PRE
PCA_TEST_VALS
(1 row)
PRINT
----------------------------------------------------
-- matrix: PCA_TEST_MOD_PCA_SDEV --
22.118368434905
2.4603199788269e-16
9.9446202776076e-17
(1 row)
PRINT
-----------------------------------------------------------------------------------------
-- matrix: PCA_TEST_MOD_PCA_SCORES --
31.804087787578, -1.4567015001103e-16, 1.2617124776816e-16
-28.062430400805, -4.1645192033268e-17, -2.6226789760277e-16
-24.320773014031, -3.6092499762165e-17, 3.1261282628732e-17
-20.579115627257, -3.0539807491063e-17, 2.6451854532004e-17
-16.837458240483, -2.4987115219961e-17, 2.1642426435276e-17
-13.095800853709, 7.1866157069922e-16, 1.6832998338548e-17
-9.3541434669349, -1.3881730677756e-17, 1.202357024182e-17
-5.6124860801609, -8.3290384066535e-18, 7.2141421450919e-18
-1.870828693387, -2.7763461355512e-18, 2.404714048364e-18
-31.804087787578, -4.719788430437e-17, 4.0880138822188e-17
(1 row)
GEMM_LARGE
------------
t
(1 row)
DELETE_MATRIX_LIKE
--------------------
t
(1 row)