question about how the function pca() calculates the covariance matrix internally
Afficher commentaires plus anciens
I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x'*x/n not x'*x/(n-1) - what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x'*x/n (or x'*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, 'Algorithm','eig’,'Centered','off') to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, 'Algorithm','eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load('Data_Table8p1.mat');
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), 'descend'); % sort
coeff = coeff(:, idx);
score = D/coeff'; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), 'descend'); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m'; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated!
Réponses (1)
Divyam
le 18 Juil 2024
Hi Florian, the "pca" and the "cov" functions perform "mean centering" by default as mentioned here:
- https://www.mathworks.com/help/stats/pca.html#:~:text=Description-,on,-Default.%20pca
- https://www.mathworks.com/help/matlab/ref/cov.html#:~:text=is%20defined%20as-,cov,),-where%20%CE%BCA
The example in the question leads to the same coefficients since both the "cov" calls return the same "coeff" and "coeff_m" as the data "D" is being mean centered by default. To illustrate this, I have written a code for calculating the covariance without mean centering and ran it on your data, the coefficients are different in this scenario. The code is added below for your reference:
% Not using the "cov" function
[N,M] = size(D);
cov_matrix = (1/(N-1)) * (D' * D);
[coeffFinal, eigValuesFinal] = eig(cov_matrix);
[eigValuesFinal, idx] = sort(diag(eigValuesFinal), 'descend');
coeffFinal = coeffFinal(:, idx);
Here is the output of the code:

4 commentaires
Florian Meirer
le 18 Juil 2024
Modifié(e) : Florian Meirer
le 18 Juil 2024
Hi @Florian Meirer, you are correct in stating that the covariance matrix is same regardless of mean centering. Ensure that when you want the results for eigenvalues, switch the algorithm for running 'pca' to 'eig': https://in.mathworks.com/help/stats/pca.html#:~:text=Algorithm%20%E2%80%94%20Principal%20component%20algorithm
The data is mean centered in PCA because PCA is used as a regression model with no intercepts (the regression line passes through the origin), which works well when the data is mean centered (data points lie around the origin) and saves us from misleading assertions.
When PCA is performed without mean centering, the eigenvectors are being calculated for
. This is not the norm but is also not incorrect as for sparse samples or time series data turning off mean centering is useful. This is because mean centering the data can cause a loss of structures or trends. Hence, there is no mathemical incorrectness in how "pca" computes, it is up to the use case.
. This is not the norm but is also not incorrect as for sparse samples or time series data turning off mean centering is useful. This is because mean centering the data can cause a loss of structures or trends. Hence, there is no mathemical incorrectness in how "pca" computes, it is up to the use case.Using the result for population covariance merely leads to a smaller result value. The eigen-vectors generated are still orthogonal. To use the sample covariance matrix for non mean centered data copy the "pca.m" file, paste it in new script and edit the "ncnancov" function. It is perfectly fine to define custom "pca" functions as long as it suits the use case and produces statistically correct assertions for the data.
Florian Meirer
le 19 Juil 2024
Modifié(e) : Florian Meirer
le 19 Juil 2024
Divyam
le 22 Juil 2024
Hi @Florian Meirer, the data used for PCA is very small and sparse (as evident in your plot) and thus using population covariance matrix is not helpful here. You are correct in using a sample covariance matrix. For this specific case, running "pca" with mean centering will unequivocably lead to correct results. In the code you will find that when you turn mean centering on, the sample covariance matrix is used to compute the results, which is exactly what you are doing in your non 'pca' code.
% In "ncnancov"
% Line 542
d = d + centered; % Here d becomes 1 when mean centering is on
% Line 551
c = x'*x/(n-d) % This becomes the result of sample covariance matrix
Catégories
En savoir plus sur Dimensionality Reduction and Feature Extraction dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!