How to select the components that show the most variance in PCA
Afficher commentaires plus anciens
I have a huge data set that I need for training (32000*2500). This seems to be too much for my classifier. So I decided to do some reading on dimensionality reduction and specifically into PCA.
From my understanding PCA selects the current data and replots them on another (x,y) domain/scale. These new coordinates don't mean anything but the data is rearranged to give one axis maximum variation. After these new coefficients I can drop the cooeff having minimum variation.
Now I am trying to implement this in MatLab and am having trouble with the output provided. MatLab always considers rows as observations and columns as variables. So my inout to the pca function would be my matrix of size (32000*2500). This would return the PCA coefficients in an output matrix of size 2500*2500.
The help for pca states:
Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance.
In this output, which dimension is the observations of my data? I mean if I have to give this to the classifier, will the rows of coeff represent my datas observations or is it now the columns of coeff?
And how do I remove the coefficients having the least variation? And thus effectively reduce the dimension of my data
Réponse acceptée
Plus de réponses (3)
naghmeh moradpoor
le 1 Juil 2017
1 vote
Dear Cyclist,
I used your code and I was successful to find all the PCAs for my dataset. Thank you! On my dataset, PC1, PC2 and PC3 explained more than 90% of the variance. I would like to know how to find which variables from my dataset are related to PC1, PC2 and PC3?
Please could you help me with this Regards, Ngh
1 commentaire
Abdul Haleem Butt
le 3 Nov 2017
dataInPrincipalComponentSpace is same as in the original data, each row is an observation, and each column is a dimension.
Sahil Bajaj
le 12 Fév 2019
0 votes
Dear Cyclist,
Thansk a lot for your helpful explanation. I used your code and I was successful to find 4 PCAs explaining 97% variance for my dataset, which had total 14 components initially. I was just wondering how to find which variables from my dataset are related to PC1, PC2, PC3 and PC4 so that I can ignore the others, and know which parameters should I use for further analysis?
Thanks !
Sahil
9 commentaires
the cyclist
le 12 Fév 2019
Modifié(e) : the cyclist
le 4 Déc 2020
In general, every variable contributes to every principal component. (The m-th element of the n-th column of the variable coeff tells you what percentage of the m-th original variable is included in the n-th principal component.) For example, I have done analyses in which the first principal component was made up of approximately equal proportions of every initial variable. They were all highly correlated, and had about the same amount of impact on the total variation!
PCA can be a dimensional reduction technique, but not necessarily. It depends on what the data say, and your needs.
There are techniques that go beyond simple PCA (e.g. varimax), which provide a further "rotation" to the variable, that try to do variable reduction. It looks like MATLAB has the rotatefactors command. I've never used it, so I can't advise.
Yaser Khojah
le 18 Avr 2019
Is there an answer for this question?
Which variables from my dataset are related to PC1, PC2, PC3 and PC4?
Here is the explinaiton of each componete which relates to PC and nothing is related to original data?
- coeff: contains coefficients for one principal component, and the columns are in descending order of component variance
- score: Rows of score correspond to observations, and columns correspond to components.
- explained: the percentage of the total variance explained by each principal component
- latent: Principal component variances, that is the eigenvalues of the covariance matrix of X, returned as a column vector.
I have used your codes and I see the coeff and v are not matching in order?
coeff =
-0.5173 0.7366 -0.1131 0.4106 0.0919
0.6256 0.1345 0.1202 0.6628 -0.3699
-0.3033 -0.6208 -0.1037 0.6252 0.3479
0.4829 0.1901 -0.5536 -0.0308 0.6506
0.1262 0.1334 0.8097 0.0179 0.5571
V =
0.0919 0.4106 -0.1131 -0.7366 -0.5173
-0.3699 0.6628 0.1202 -0.1345 0.6256
0.3479 0.6252 -0.1037 0.6208 -0.3033
0.6506 -0.0308 -0.5536 -0.1901 0.4829
0.5571 0.0179 0.8097 -0.1334 0.1262
However, (dataInPrincipalComponentSpace and score) and (var(dataInPrincipalComponentSpace)' and latent) are matching. Does that mean, the first row in latent is related to the first column in the original data? I think any new use is confused about how to related these answers to the original data's variables? Can you please explain. Thank you
the cyclist
le 19 Avr 2019
Modifié(e) : the cyclist
le 15 Juin 2020
Your first question
Recall that the original data is an array with M observations of N variables. There will also be N principal components. The relationship between the original data and the nth PC is
nth PC = X*coeff(:,n) % This is pseudocode, not valid MATLAB syntax.
For example, PC1 is given by
PC1 = X*coeff(:,1)
You can recover the original data from the principal components by
dataInPrincipalComponentSpace * coeff'
Your second question
The first row of latent is not related to the first column of the original data. It is related to the first principal component (which you can see is a linear combination of the original data).
Salma Hassan
le 18 Sep 2019
for all the researchers above i need your help please
Dear @the cyclist
Regarding the answer to your first question.
Lets say I have found the Eigen Values sorted in descending order which is the case after following your code above.
For the Eigen vectors corresponding to the sorted Eigen Values, I would like to recover the original data, so only those variables (or columns) of the original matrix that correspond to the first 3 principle vectors, for example.
Please advise on the backwards transformation.
% Can I do this,
% is this corresponding first 3 or most needed 3 variable columns ?
Xexp = dataInPrincipalComponentSpace(:, 1:3) * coeff(1:3, 1:3)' + meanX(1:3);
% Where meanX = mean(X, 1);
the cyclist
le 28 Fév 2020
Please carefully read the question asked by Sahil Bajaj in this sequence of comments, and my answer to it.
I'll quote myself here: "In general, every variable contributes to every principal component." In my example with 5 variables, if they had all been very highly correlated with each other, that all 5 of them contributed significantly to the first principal component. You could not eliminate any of the original variables without significant loss of information.
Referring again to the figure at the top of the wikipedia page on PCA: you can't eliminate either the x-axis variable or the y-axis variable. Instead, you choose a linear combination of them that captures the maximal variation.
And, repeating myself one more time ... there are techniques like varimax, applied after PCA, that do allow you to remove some of the original variables.
Darren Lim
le 2 Fév 2021
, thanks for answering this post, you wouldnt imagine how much time i have saved by studying your answer, so thank you!
i just picked up PCA a few days ago to solve a financial trading problem, so I am very new to PCA. Just to confirm my understanding , in the coeff example you provided ;
coeff =
-0.5173 0.7366 -0.1131 0.4106 0.0919
0.6256 0.1345 0.1202 0.6628 -0.3699
-0.3033 -0.6208 -0.1037 0.6252 0.3479
0.4829 0.1901 -0.5536 -0.0308 0.6506
0.1262 0.1334 0.8097 0.0179 0.5571
can I clarify that for Column 1 , the Variable of co-efficient 0.6256 describe the largest "weightage" in accordance to PC 1 ? so if my Variable(2,1) is say the mathematics (0.6256) subject of my 7 sample students(Observations) , can I say that Mathematics then , account for the largest "Variance" among all the 7 students in the whole data set (since PC1 has the highest variance and also has accounted for 42.2% of the entire data set) ?
and say , Variable(1,1) is English(-0.5173) , does it mean that English tend to anti correlate to Mathematics?
..and for PC2 , Variable(2,1) English (0.7366) describe the difference the most for the Sample students ?
In Essence , i think i roughly understand PCA at high level , what i am not so sure is how to intepret the data , as i think PCA is powerful but wont be useful if the output is misintepreted. Any help interpreting the coeff will be appreciated :) ( my challenge is to find out which variable is useful for my trading and eliminate unnecesary variables so that i can optimise a trading strategy )
Thanks in advance !
the cyclist
le 2 Fév 2021
Modifié(e) : the cyclist
le 2 Fév 2021
I'm happy to hear you have found my answer to be helpful.
The way you are trying to interpret the results is a little confusing to me. Using your example of school subjects, I'll try to explain how I would interpret.
Let's suppose that the original dataset variables (X) are scores on a standardized exam:
- Math (column 1)
- Writing
- History
- Art
- Science
[Sorry I changed up your subject ordering.]
Each row is one student's scores. Row 3 is the 3rd student's scores, and X(3,4) is the 3rd student's Art score.
Now we do the PCA, to see what combination of variables explains the variation among observations (i.e. students).
coeff is the coefficients of the linear combination of original variables . coeff(:,1) are the coefficients to get from the original variables to the first new variable (which explains the most variation between observations):
-0.5173*Math + 0.6256*Writing -0.3033*History + 0.4829*Art + 0.1262*Science
At this point, the researcher might try to interpret these coefficients. For example, because Writing and Art are very positively weighted, maybe this variable -- which is NOT directly measured! -- is something like "Creativity".
Similarly, maybe the coefficients coeff(:,2), which weights Math very heavily, corresponds to "Logic".
And so on.
So, interpreting that single value of 0.6256, I think you can say, "Writing is the most highly weighted original variable in the new variable that explains the most variation."
But, it also seems to me that to answer a couple of your questions, you actually want to look at the original variables, and not the PCA-transformed data. If you want to know which school subject had the largest variance -- just calculate that on the original data. Similarly for the correlation between subjects.
PCA is (potentially) helpful for determining if there is some underlying variable that explains the variation among multiiple variables. (For example, "Creativity" explaining variation in both Writing and Art.) But, factor analysis and other techniques are more explicitly designed to find those latent factors.
Darren Lim
le 3 Fév 2021
Crystal Clear! I think many others will find this answer helpful as well , thanks again for your insights and time!
Darren
Salma Hassan
le 18 Sep 2019
0 votes
i still not understand
i need an answer for my question------> how many eigenvector i have to use?
from these figures



3 commentaires
the cyclist
le 19 Sep 2019
It is not a simple answer. The first value of the explained variable is about 30. That means that the first principal component explains about 30% of the total variance of all your variables. The next value of explained is 14. So, together, the first two components explain about 44% of the total variation. Is that enough? It depends on what you are trying to do. It is difficult to give generic advice on this point.
You can plot the values of explained or latent, to see how the explained variance is captured as you add each additional component. See, for example, the wikipedia article on scree plots.
Salma Hassan
le 19 Sep 2019
if we say that the first two components which explain about 44% enough for me, what does this mean for latent and coff . how can this lead me to the number of eigen vectors
thanks for your interest in reply. i appreicate this
the cyclist
le 20 Sep 2019
Modifié(e) : the cyclist
le 16 Juin 2020
It means that the first two columns of coeff are the coefficients you want to use.
Catégories
En savoir plus sur Dimensionality Reduction and Feature Extraction dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



