question about how the function pca() calculates the covariance matrix internally

I was puzzled by the output of pca() when using mean centering or not. I am using Matlab 2024a.
pca.m uses the internal function c = ncnancov(x,Rows,centered) which seems to provide the covariance matrix of x
however,
1) it uses the formula for the population covariance, i.e. it calculates x'*x/n not x'*x/(n-1) - what is the rationale behind that?
2) it does not mean center x. This is surprising because without mean centering x the formula x'*x/n (or x'*x/(n-1) for that matter) does NOT provide the covariance matrix
The second point causes the call [coeff,score,latent]=pca(D, 'Algorithm','eig’,'Centered','off') to produce different coeff, and latent from the call [coeff,score,latent]=pca(D, 'Algorithm','eig’). The scores will obviosuly be different but coeff and latent should not be affected by mean centering as can be shown by comparing the output of:
load('Data_Table8p1.mat');
Dm = D-mean(D);
[coeff,eigValues] = eig(cov(D));
[eigValues, idx] = sort(diag(eigValues), 'descend'); % sort
coeff = coeff(:, idx);
score = D/coeff'; % get scores of mean centered data
with:
[coeff_m,eigValues_m] = eig(cov(Dm));
[eigValues_m, idx] = sort(diag(eigValues_m), 'descend'); % sort
coeff_m = coeff_m(:, idx);
score_m = Dm/coeff_m'; % get scores of mean centered data
Probably I am missing something, but the internal function ncnancov() as used in pca is unclear to me. Any explanation is much appreciated!

Réponses (1)

Hi Florian, the "pca" and the "cov" functions perform "mean centering" by default as mentioned here:
The example in the question leads to the same coefficients since both the "cov" calls return the same "coeff" and "coeff_m" as the data "D" is being mean centered by default. To illustrate this, I have written a code for calculating the covariance without mean centering and ran it on your data, the coefficients are different in this scenario. The code is added below for your reference:
% Not using the "cov" function
[N,M] = size(D);
cov_matrix = (1/(N-1)) * (D' * D);
[coeffFinal, eigValuesFinal] = eig(cov_matrix);
[eigValuesFinal, idx] = sort(diag(eigValuesFinal), 'descend');
coeffFinal = coeffFinal(:, idx);
Here is the output of the code:

4 commentaires

Hi Divyam, many thanks for your answer and your efforts! However, unfortunately it does not answer my question.
I know that cov() mean centers by default, and this is correct because otherwise the formula it uses, which is X'*X/(n-1), would be wrong. Mathematically, the covariance matrix of a mean centered and non-mean-centered matrix is the same.
However, pca() has the option to turn off mean centering (see https://nl.mathworks.com/help/stats/pca.html#namevaluepairarguments ), so, it is mean centering by default but that can be turned off (and that option is good, because some applications need that). However, if that is done, pca() uses the internal function ncnancov() to determine the covariance matrix, and that function seemingly does NOT mean center the data by itself before calculating the covariance matrix using X'*X/n. And that is, as far as I can see it, incorrect mathematically.
Maybe to clarify: in the case that ('Centered' is set to 'off') pca() then produces wrong results, and because ncnancov() also uses X'*X/n and not X'*X/(n-1) in that case, it also produces an inconsistent (incorrect?) latent variable even when feeding already mean centered data when compared to the results of eig(cov()), which should give the same results. Please try my example using pca() as I write it above to see what I mean.
As I wrote I am puzzled by this, because if it is true, pca() should never be used without the default mean centering, but then the option 'Centered' = 'off' should be removed. Which would be a pity because it is useful e.g. in multivariate calibration.
Divyam
Divyam le 19 Juil 2024
Modifié(e) : Divyam le 19 Juil 2024
Hi @Florian Meirer, you are correct in stating that the covariance matrix is same regardless of mean centering. Ensure that when you want the results for eigenvalues, switch the algorithm for running 'pca' to 'eig': https://in.mathworks.com/help/stats/pca.html#:~:text=Algorithm%20%E2%80%94%20Principal%20component%20algorithm
The data is mean centered in PCA because PCA is used as a regression model with no intercepts (the regression line passes through the origin), which works well when the data is mean centered (data points lie around the origin) and saves us from misleading assertions.
When PCA is performed without mean centering, the eigenvectors are being calculated for . This is not the norm but is also not incorrect as for sparse samples or time series data turning off mean centering is useful. This is because mean centering the data can cause a loss of structures or trends. Hence, there is no mathemical incorrectness in how "pca" computes, it is up to the use case.
Using the result for population covariance merely leads to a smaller result value. The eigen-vectors generated are still orthogonal. To use the sample covariance matrix for non mean centered data copy the "pca.m" file, paste it in new script and edit the "ncnancov" function. It is perfectly fine to define custom "pca" functions as long as it suits the use case and produces statistically correct assertions for the data.
many thanks for your reply - this is really useful and interesting and you are fully correct that eig(X'*X/(N-1)) (or eig(X'*X/N)) provides an orthonormal basis set that can be used to express the data. However, this is not the correct way of doing PCA, because then it is not guaranteed that the eigenvectors describe the data set in terms of its variance, which is the whole point of doing PCA. I have pasted code below you can run with the data set I provided (use only 2 dimensions of the data set to see the 2D plot, but you can use any two dimensions). There you can see that using the eigenvalue decomposition of X'*X/N leads to completely incorrect results, that is, a rotation of the basis set that does not describe the data in terms of its largest, second largest, third largest, etc variance and in some cases it even changes PC1 into PC2 and vice versa. The eigenvalues used to determine the CVE are extremly different, suggesting, for example, that the data set would be very sufficiently described by only 1 PC while actually 2 are needed to achieve that level of CVE. That is, because these eigenvalues do not describe the data set in terms of its variance.
Please let me know what you think, I really appreciate this conversation!
To run the code:
testPCA(D(:,2:3));
or testPCA(D(:,[1 3])) or testPCA(D(:,1:2)) etc
function testPCA(D)
% use only 2D data for th 2D plot
% 1) Using the covariance matrix
[m,~]=size(D);
[eigVect,EV] = eig(cov(D));
[EV, idx] = sort(diag(EV), 'descend');
eigVect = eigVect(:, idx);
% 2) Using the way pca() determines the eigenvectors and eigenvalues:
[eigVect1,EV1] = eig(D'*D/m);
[EV1, idx] = sort(diag(EV1), 'descend');
eigVect1 = eigVect1(:, idx);
% plot the eigenvectors and the data:
mD = mean(D);
x_0 = mD(1,1);
y_0 = mD(1,2);
d = sqrt(EV);
d1 = sqrt(EV1);
figure();
subplot(1,2,1);
plot(D(:,1),D(:,2),'o');
hold on;
quiver(x_0,y_0,eigVect(1,1),eigVect(2,1),d(1),'k','LineWidth',5);
quiver(x_0,y_0,eigVect(1,2),eigVect(2,2),d(2),'r','LineWidth',5);
title('using eig(cov(D))');
hold off;
axis equal;
subplot(1,2,2);
plot(D(:,1),D(:,2),'o');
hold on;
quiver(x_0,y_0,eigVect1(1,1),eigVect1(2,1),d1(1),'k','LineWidth',5);
quiver(x_0,y_0,eigVect1(1,2),eigVect1(2,2),d1(2),'r','LineWidth',5);
title('using eig(D''*D/m)');
hold off;
axis equal;
Hi @Florian Meirer, the data used for PCA is very small and sparse (as evident in your plot) and thus using population covariance matrix is not helpful here. You are correct in using a sample covariance matrix. For this specific case, running "pca" with mean centering will unequivocably lead to correct results. In the code you will find that when you turn mean centering on, the sample covariance matrix is used to compute the results, which is exactly what you are doing in your non 'pca' code.
% In "ncnancov"
% Line 542
d = d + centered; % Here d becomes 1 when mean centering is on
% Line 551
c = x'*x/(n-d) % This becomes the result of sample covariance matrix

Connectez-vous pour commenter.

Catégories

Produits

Version

R2024a

Commenté :

le 22 Juil 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by