Feature selection using clustering

Question

Kamil le 28 Avr 2011

2
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/6396-feature-selection-using-clustering

Réponse apportée : arushi le 22 Août 2024

I have to select features using clastering method - Ward's algorithm.

Short description of dataset: 16000 records, 5400 features (float) each record. I make some subset because working on the full set causes out of memory.

Reading Matlab docs it is quite easy:

X = load('subset.data');
Y = pdist(X);
Z = linkage(Y,'ward');
T = cluster(Z,'maxclust',2); % I set 2 clasters because in my dataset is 2 classes of objects. But now, I'm not sure if it is ok.
% PCA visualization
[W, pc] = princomp(X);
scatter(pc(:,1),pc(:,2),10,T,'filled')

And now, I don't know what to do next. How can I select features? Now, I think that instead of Y = pdist(X) it should be Y = pdist(X'), because I want to have clusters of features and than select some of them, right? But the problem is Y = pdist(X') causes out of memory. I would be greatful for answer, if my way of thinking is correct.

Thank you in advance!

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

arushi le 22 Août 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/6396-feature-selection-using-clustering#answer_1503224

Hi Kamil,

When using clustering methods like Ward's algorithm for feature selection, the goal is to group similar features together and then select representative features from each cluster. You're correct in thinking that you need to cluster the features rather than the records, which means you should transpose your dataset. However, as you've noticed, computing the pairwise distances for such a large number of features can be memory-intensive.

Here are some strategies to handle this problem and proceed with feature selection:

Strategies for Clustering Features

Dimensionality Reduction Before Clustering:

Consider applying a dimensionality reduction technique, like Principal Component Analysis (PCA), to reduce the number of features before clustering. This can help alleviate memory issues.
You can use the top principal components as a lower-dimensional representation of your features

Sample a Subset of Features:

Randomly sample a subset of features to perform the clustering. Once you have identified clusters, you can evaluate the importance of features within those clusters on the full dataset.

Incremental or Batch Processing:

Process the data in smaller batches. Although this can be complex to implement for clustering, it might be necessary if memory constraints are severe.

Use Efficient Data Structures:

Ensure that your data is stored in a memory-efficient format. Consider using MATLAB's tall arrays or other memory-efficient data structures.

Reduce Precision:

If possible, reduce the precision of your data (e.g., using single instead of double) to save memory.

Correcting Your Approach

Given your goal, here's how you can adjust your approach:

Transpose the Data:

Use X' to transpose the data, so you are clustering the features instead of the records.

Compute Pairwise Distances:

Compute the pairwise distances between features. If pdist(X') causes memory issues, consider reducing the number of features first.

Linkage and Clustering:

Use the linkage function to perform hierarchical clustering on the features.

Select Features:

After clustering, select representative features from each cluster. You can choose features that are closest to the centroid of each cluster or use domain knowledge to select features.

Hope this helps.

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Feature selection using clustering

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (1)

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

Feature selection using clustering

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (1)

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens