## How to apply PCA correctly?

### Sepp (view profile)

on 12 Dec 2015
Latest activity Commented on by the cyclist

on 24 Jul 2019

### the cyclist (view profile)

Hello
I'm currently struggling with PCA and Matlab. Let's say we have a data matrix X and a response y (classification task). X consists of 12 rows and 4 columns. The rows are the data points, the columns are the predictors (features).
Now, I can do PCA with the following command:
[coeff, score] = pca(X);
As I understood from the matlab documentation, coeff contains the loadings and score contains the principal components in the columns. That mean first column of score contains the first principal component (associated with the highest variance) and the first column of coeff contains the loadings for the first principal component.
Is this correct?
But if this is correct, why is then X * coeff not equal to score?

### the cyclist (view profile)

on 12 Dec 2015

Maybe this script will help.
rng 'default'
M = 7; % Number of observations
N = 5; % Number of variables observed
X = rand(M,N);
% De-mean
X = bsxfun(@minus,X,mean(X));
% Do the PCA
[coeff,score,latent] = pca(X);
% Calculate eigenvalues and eigenvectors of the covariance matrix
covarianceMatrix = cov(X);
[V,D] = eig(covarianceMatrix);
% "coeff" are the principal component vectors. These are the eigenvectors of the covariance matrix. Compare ...
coeff
V
% Multiply the original data by the principal component vectors to get the projections of the original data on the
% principal component vector space. This is also the output "score". Compare ...
dataInPrincipalComponentSpace = X*coeff
score
% The columns of X*coeff are orthogonal to each other. This is shown with ...
corrcoef(dataInPrincipalComponentSpace)
% The variances of these vectors are the eigenvalues of the covariance matrix, and are also the output "latent". Compare
% these three outputs
var(dataInPrincipalComponentSpace)'
latent
sort(diag(D),'descend')

the cyclist

### the cyclist (view profile)

on 22 Mar 2019
Yes, bsxfun is a built-in function. It applies the element-wise operation, implicitly expanding either array, if necessary. With more modern versions of MATLAB, implicit expansion will happen automatically, so one could actually replace that line with
X = X - mean(X);
Take a look at this CrossValidated answer about why centering (i.e. de-meaning) can be important.
Jaime de la Mota

### Jaime de la Mota (view profile)

on 24 Jul 2019
This is very interesting, but a question comes to my mind. Coeffs are the eigenvectors and scores are the projection of the data in the principal component space.
Are these then equivalent to the eigenfunctions and random variables of the Karhunen-Loève expansion?
the cyclist

### the cyclist (view profile)

on 24 Jul 2019
That's a math question, not a MATLAB question. :-)
I don't really know, but this abstract -- I did not access or read the paper itself -- suggests that KL and PCA are not strictly equivalent.

### Yaser Khojah (view profile)

on 17 Apr 2019

Dear the cyclist, thanks for showing this example. I have a question regarding to the order of the COEFF since they are different than the V. Is there anyway to see which order of these columns? In another word, what are the variables of each column?

the cyclist

### the cyclist (view profile)

on 17 Apr 2019
Quoting from the first section of the documentation for the pca function.
"Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance."
You can see that
var(dataInPrincipalComponentSpace)
has values in descending order.
Yaser Khojah

### Yaser Khojah (view profile)

on 17 Apr 2019
i understand that but I do not see how the PC is related to the column of the original data (X). How can I know which variables from the original data has the strength impact?