how can I use pdist2 function for big data?

14 vues (au cours des 30 derniers jours)
mina movahed
mina movahed le 29 Avr 2016
I want to implement k-means in matlab. my data set is matrix 9,000,000 by 1. when I used Euclidean for finding distance of points, I faced with following error:
Error using pdist2mex
Out of memory. Type HELP MEMORY for your options.
Error in pdist2 (line 343)
D = pdist2mex(X',Y',dist,additionalArg,smallestLargestFlag,radius);
Error in k_means_new (line 38)
dist = pdist2(d,centroids,distance); % distance between all data points and
centroids
I'd like to mention that I used matlab in system with windows 8 and following configuration :
RAM: 8G
CPU: intel core i5-3230M
so would you please help me?
Thanks in advance.
  2 commentaires
Walter Roberson
Walter Roberson le 29 Avr 2016
what is size(d) and size(centroids) ?
mina movahed
mina movahed le 30 Avr 2016
Modifié(e) : mina movahed le 30 Avr 2016
size(d)= 9000000 * 1
size(centroids)=240

Connectez-vous pour commenter.

Réponses (2)

Image Analyst
Image Analyst le 30 Avr 2016
Chances are you don't need that all in memory at the same time. What are you really trying to do? Like find the two points farthest from each other? If so, a simple double for loop where you're storing only the max distance (one value) instead of an 18 gigapixel array would work. OR you might be able to get what you need by taking a subsample of your original 9 million element array. So tell us the big picture. What are you really trying to accomplish so we can advise you on a better, less memory intensive approach.
  1 commentaire
mina movahed
mina movahed le 2 Mai 2016
first of all, sorry I did not see your comment. as Walter said, it is better, to rewrite the algorithm to not need as much memory. I want to implement some data mining algorithms in Matlab and after the analyze the data.

Connectez-vous pour commenter.


Walter Roberson
Walter Roberson le 30 Avr 2016
Why are you bothering with euclidean distance between 1 dimension objects? That is the same as abs() of the difference between them
abs(bsxfun(@minus, d, centroids(:).'))
This is only going to be 9000000 * 240 entries, each of 8 bytes, which is only 17.28 gigabytes. An additional working storage of 9000000 * 8 bytes (72 megabytes) would also be required. Just make sure your swap space is set large enough to hold the array, and set your preferences to not prevent large arrays. It should probably only take 5 or 6 hours to compute.
  6 commentaires
mina movahed
mina movahed le 2 Mai 2016
thanks a lot. I will try this and if it worked, I will inform you. the task is implementation of k_means and so I need to find the distance between all points and centroids.
Walter Roberson
Walter Roberson le 2 Mai 2016
For k_means you do not need to retain those distances, you only need to figure out where the closest one is. That takes the long term storage requirement down by a factor of length(centroids)

Connectez-vous pour commenter.

Catégories

En savoir plus sur Text Data Preparation dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by