Hi Kamil,
When using clustering methods like Ward's algorithm for feature selection, the goal is to group similar features together and then select representative features from each cluster. You're correct in thinking that you need to cluster the features rather than the records, which means you should transpose your dataset. However, as you've noticed, computing the pairwise distances for such a large number of features can be memory-intensive.
Here are some strategies to handle this problem and proceed with feature selection:
Strategies for Clustering Features
Dimensionality Reduction Before Clustering:
- Consider applying a dimensionality reduction technique, like Principal Component Analysis (PCA), to reduce the number of features before clustering. This can help alleviate memory issues.
- You can use the top principal components as a lower-dimensional representation of your features
Sample a Subset of Features:
- Randomly sample a subset of features to perform the clustering. Once you have identified clusters, you can evaluate the importance of features within those clusters on the full dataset.
Incremental or Batch Processing:
- Process the data in smaller batches. Although this can be complex to implement for clustering, it might be necessary if memory constraints are severe.
Use Efficient Data Structures:
- Ensure that your data is stored in a memory-efficient format. Consider using MATLAB's tall arrays or other memory-efficient data structures.
Reduce Precision:
- If possible, reduce the precision of your data (e.g., using single instead of double) to save memory.
Correcting Your Approach
Given your goal, here's how you can adjust your approach:
Transpose the Data:
- Use X' to transpose the data, so you are clustering the features instead of the records.
Compute Pairwise Distances:
- Compute the pairwise distances between features. If pdist(X') causes memory issues, consider reducing the number of features first.
Linkage and Clustering:
- Use the linkage function to perform hierarchical clustering on the features.
Select Features:
- After clustering, select representative features from each cluster. You can choose features that are closest to the centroid of each cluster or use domain knowledge to select features.
Hope this helps.