kmeans for dataset that's too big for memory
5 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Hi,
I'd like to do kmeans for a dataset that is too big to be loaded in memory. What options do I have? I can load a part of the dataset to workspace by using MATFILE but general kmeans function requires loading entire dataset, which is not possible for my computer. Please help.
0 commentaires
Réponses (1)
Jan
le 7 Oct 2013
Modifié(e) : Jan
le 7 Oct 2013
The obvious strategy is to install more RAM.
If you start with the claim, that the function you want to use requires the complete data set to be loaded at once, there is no way to escape from the need to load the complete data set at once. If this is not possible for your computer, upgrade it or run the code on a more powerful computer.
Sorry, I know that this answer is trivial. But it is the only possible answer matching your question exactly. Of course you could apply a data compression in the RAM or condense the data at first. But this is not exactly what you are asking for, and in addition, it would be much more expensive most likely.
4 commentaires
Dominique
le 5 Juin 2025
Although the question is outdated, I hope my answer is still helpful to others. My needs may be specific in that I have a dataset of 100000 samples, but I need to do a kmeans run of maybe ~10000 level as the number of clusters, so tens of GB of memory usage is very common for me on matlab.
So I compared kmeans runs on matlab and sklearn, and found that the former is <0.5x as fast as the latter. but the memory consumption is completely ahead on sklearn, probably only a few hundred MB. if you use sklearnex, an Intel chip based acceleration of sklearn, the memory consumption is slightly lower, but the acceleration is simply unrivaled.
As an example, I ran 10,000 samples for a 20,000 clustering, on matlab it was 60s, ~35,000MB, on sklearn it was 130s, ~470MB, and sklearnex accelerated, it was 17s, ~450MB. so I think the answer to this one is probably obvious.
Voir également
Catégories
En savoir plus sur Statistics and Machine Learning Toolbox dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!