Faster array comparison with big datasets

Question

0 votes

Benchmark.m

I am trying to compare two large arrays to each other and do some extra operations. So far it works but it is very slow. Coming from a little C++ background, I can use pointers to their location instead of taking the value of each. I have found how to do so in matlab using shrlib but I am not sure if it works for doubles.

What I have available:

Two textfiles each in the following format: name in col 1, value1 in col2 and value2 in col 3

filename       value1       value2
---------------------------------------
foo0001         100          200
foo0001         140          200
foo0001         135          155
foo0002         124          145
foo0002         135          155
foo0003         1419         1943
   .             .            .
   .             .            .

I would want to:

compare the values of value1 and value2 from the first file to those of the second file
be tolerant to errors with respect to elements from value1 and value2
the file comparison to run fast ( my current method is very slow)

So here is what I have done so far:

file1 = readtable('file1.txt', 'Delimiter',' ','ReadVariableNames',false);
fclose(fileID);
file2 = readtable('file2.txt', 'Delimiter',' ','ReadVariableNames',false);
fclose(fileID);
n = 1000;
detectedCorn = zeros(n,2);   %initialize large nx2 vectors to save value1 and value2
referenceCorn = zeros(n,2);
null = ['foo000',int2str(ip)]; % initialize filename as "dynamic" string
for ip = 1:100
    iv = 1;   % initialize index variables to use later on
    iv2 = 1;  %
      for ii = 1:numel(file1.Var1) %%all Var1 -> name in first col in file1 table
          if strcmp(file1.Var1(ii), null) == 1;
              var = [file1.Var2(ii),file1.Var3(ii)];
              detectedCorn(iv,:) = var;
              var = zeros(1,2);
              iv = iv + 1;
          end
          %REPEAT SAME STEP AS ABOVE FOR file2( I will skip that so the code doesnt get too crowded)
          for kk = 1:1000
              for rr = 1:1000
                  var3 = [0,0];
                  if all(ismember(detectedCorn(kk,:),var3)) == 0  &&...
                     all(ismember(referenceCorn(kk,:),var3)) == 0   %%do not count the empty spaces in the array
                      if all(ismember(detectedCorn(kk,:),referenceCorn(rr,:))) == 1  %%iterate through the two vectors and compare
                          % do something
                      else
                          % do something else
                      end
                  end
              end
          end
      end
  end
          ---------------------------------------------------------------------------
  Clarification: 
For example I have these files:
file 1: Reference
imagename.png       value1    value2
000000004.png       100        200
000000004.png       120        130
000000004.png       150        1500
000000005.png       1214       501
file 2: Detected
imagename.png        value1    value2
000000004.png       100       200
000000004.png       204       300
000000004.png       300       400
000000004.png       129       140
000000004.png       152       1499        
000000005.png       1200      500

What I mean by compare, I want to do the following: for example the first row in the Detected file (file2) I have row1. First thing I have to make sure that the filenames in the two files are the same. If that so then I can compare the data in x and y. if difference of value1 from Detected file (in this example 100) and the value1 from Reference file is equal to 0,1,2,-1,-2 then, check if difference of value2 of the same row in both files equal to 0,1,2,-1,-2, increment the variable TP (Positive match) with 1. And so on for all files... By that logic the for the above data TP = 2, FP = data found in Detected but not in reference and FN is data found in reference but not in detected.

Notice that file detected has way more entries than the first file and does not necessarily have the same filenames

9 commentaires
Afficher 7 commentaires plus anciens Masquer 7 commentaires plus anciens

Stephen23 le 5 Mar 2015

Modifié(e) : Stephen23 le 5 Mar 2015

That is why I wrote "then you need to learn about code vectorization", because vectorization is the best, fastest, and neatest MATLAB syntax. Forget about solving problems using loops, that is C++ syntax.

Vectorization is your first choice in MATLAB, loops are your second.

The documentation states: MATLAB® is optimized for operations involving matrices and vectors. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. Vectorizing your code is worthwhile for several reasons:

Appearance: Vectorized mathematical code appears more like the mathematical expressions found in textbooks, making the code easier to understand.
Less Error Prone: Without loops, vectorized code is often shorter. Fewer lines of code mean fewer opportunities to introduce programming errors.
Performance: Vectorized code often runs much faster than the corresponding code containing loops.

You might also be interested in these webinars:

http://de.mathworks.com/matlabcentral/fileexchange/41989-matlab-for-c-c++-programmers

If you uploaded your whole code files, we could give some more concrete advice too!

KBK007 le 5 Mar 2015

Modifié(e) : KBK007 le 5 Mar 2015

Ouvrir dans MATLAB Online

Hey stephen, at the moment I do not have the right to upload the dataset I have (the reference one since it is not produced by me). Unfortunately I will have to tell you (in words) what I require from my program. The input style is like the one above, I have two text files having the format:

   imagename.png    value1    value2
   000000004.png       x        y
   000000004.png       x2       y2
   000000004.png       x3       y3
   000000005.png       x4       y4
 (this explains the ip = 4:169 the last image i have is 0000000004.png and the last is 0000000169.png hence the 2 if statements)

The reference file has less entries for EACH filename than the ones in the DetectedCorners file. So i could have 100 rows for 0000000004.png in the first file and 700 in the other. However since the data in the ReferenceCorner are the ones I take reference, all the data belonging to image 0000000004.png in the detected corners have to be compared to the ones corresponding only to 0000000004.png(that is same filename). And yes the goal is counting and producing how many of those are true. (to give you an idea why there are 3 values in each row, I have processed images and I am comparing the pixel locations I detected with my algorithm w.r.t a reference to see how good or bad my algorithm is and perform optimization operations on it)

KBK007 le 5 Mar 2015

stephen I know this is gona be a dumb question to ask, but I guess it is something you guys have encountered before, isn't a table "vectorization" ? it is a nx3 dimensional array is it not ? or should I put it in a matrix structure and then pull some operations on it? I am going to check out the definition of a table and try to put the data in a nx2 (for each image) matrix. Is that what you mean ?

Oleg Komarov le 2 Avr 2015

Have you tried running the profiler to see where the bottlenecks are?

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

Stephen23 le 5 Mar 2015

Modifié(e) : Stephen23 le 5 Mar 2015

Ouvrir dans MATLAB Online

1 vote

My version of MATLAB is a little long in the tooth and does not support tables, so I used simple numeric arrays instead, but this code should get you started on your way. One big hint for using MATLAB: learn to navigate and use the documentation , it is actually very helpful.

fid = fopen('tempDet.txt','rt');
DetC = textscan(fid,'%s%f%f','HeaderLines',2,'CollectOutput',true);
fclose(fid);
fid = fopen('tempRef.txt','rt');
RefC = textscan(fid,'%s%f%f','HeaderLines',2,'CollectOutput',true);
fclose(fid);
[DetV,~,DetX] = unique(DetC{1});
[RefV,~,RefX] = unique(RefC{1});
assert(all(strcmp(DetV,RefV)),'Some filenames are missing in one of the files!')
for k = reshape(unique(DetX),1,[])
    DetK = DetX==k;
    RefK = RefX==k;
    DetN = sum(DetK)
    RefN = sum(RefK)
    intX = intersect(DetC{2}(DetK,:),RefC{2}(RefK,:),'rows')
    intN = size(intX,1)
end

It opens each file (attached), checks that the filenames correspond, then for each filename finds the ones with matching numeric values (both values in the row must be the same). The sums and the intN tell you how values there are for each file, and many match correctly, respectively. The (rather simple) data files that I used are attached here:

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

KBK007 le 6 Mar 2015

Thank you stephen for your time and contribution. My only concern is this program seems to require that for example the foo0001 from the first file MUST have a corresponding foo0001 in the other file. However since I am testing an algorithm against another, I may detect some false positives and may have data for foo0001 despite the fact that in the reference file there is no foo. I will look at it myself but thank you for the example of vectorization, now i sorta get what you were aiming for.

Connectez-vous pour commenter.

Answer 2

Peter Perkins le 5 Mar 2015

0 votes

KBK007, as Stephen said, your code is unvectorized, and that's why it is slow. It is also complicated enough that its purpose kind of hard to figure out. It seems like you're doing something along the lines of finding data in one file that are also present in another file. That may be related to a join operation, or some kind of intersection, or even some kind of grouped calculation. There are functions specifically for those operations, but it's really hard to say what you should be using without a clear description of what data you have and what you want do with it. In words, not in code. You should be able to make a very small pair of example data files, and show exactly what results you want from that.

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Answer 3

Peter Perkins le 9 Mar 2015

Ouvrir dans MATLAB Online

0 votes

Given Stephen's interpretation of your code, and some similar data, here is a vectorized version that prints out

filenames that occur in only one of the two tables
the number of rows for each filename in each table
rows with filename/value1 in common between the two tables

Not sure if this does what you're looking for, but at least it will give you an idea of what is meant by "vectorization". Hope it helps.

>> DetC = table( ...
       {'foo0001'; 'foo0001'; 'foo0001'; 'foo0002'; 'foo0002'; 'foo0003'; 'foo0004'}, ...
       [101; 141; 133; 124; 135; 1419; 1420], ...
       [199; 200; 154; 145; 152; 1933; 1934], ...
       'VariableNames',{'filename' 'value1' 'value2'});
>> RefC = table( ...
       {'foo0001'; 'foo0002'; 'foo0002'; 'foo0003'}, ...
       [141; 124; 123; 1419], ...
       [200; 144; 130; 1933], ...
       'VariableNames',{'filename' 'value1' 'value2'});
>> % Find filenames that occur in only one of the two tables
>> setxor(DetC.filename,RefC.filename)
ans = 
    'foo0004'
>> % Find the number of rows for each filename in each table
>> [~,~,uindices] = unique(DetC.filename);
>> accumarray(uindices,1)
ans =
     3
     2
     1
     1
>> [~,~,uindices] = unique(RefC.filename);
>> accumarray(uindices,1)
ans =
     1
     2
     1
>> % Find the rows with filename/value1 in common between the two tables
>> intersect(DetC(:,1:2),RefC(:,1:2))
ans = 
    filename     value1
    _________    ______
    'foo0001'     141  
    'foo0002'     124  
    'foo0003'    1419

3 commentaires
Afficher 1 commentaire plus ancien Masquer 1 commentaire plus ancien

KBK007 le 10 Mar 2015

Modifié(e) : KBK007 le 10 Mar 2015

I don't think vectorization can be applied here. I don't have the same data size and when comparing matrices or the difference between the values, each value has to be present in both. This means the dimensions should agree otherwise I think it is loopy loop for me. Also I really want to get the difference of each value against all others in the other file, it is compulsory to use a loop

Stephen23 le 2 Avr 2015

Modifié(e) : Stephen23 le 2 Avr 2015

Ouvrir dans MATLAB Online

You can still compare data in differently-sized groups by using bsxfun. It is a very fast and efficient tool that can be used in vectorized code, and allows you to compare differently sized vectors very quickly. You can do something like this:

>> X = [1,2,4,6,8];
>> Y = [1,3,4,5,7,8,9];
>> bsxfun(@eq,X(:),Y)
ans =
   1     0     0     0     0     0     0
   0     0     0     0     0     0     0
   0     0     1     0     0     0     0
   0     0     0     0     0     0     0
   0     0     0     0     0     1     0

You can then use any and all or other logical operations to select which values you want to look at. And of course you can choose or define the function use in bsxfun (I used the inbuilt eq).

If you decide to go with using a loop, then you really need to read this:

http://www.mathworks.com/help/matlab/matlab_prog/techniques-for-improving-performance.html

Connectez-vous pour commenter.

Faster array comparison with big datasets

9 commentaires
Afficher 7 commentaires plus anciens Masquer 7 commentaires plus anciens

Réponses (3)

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

3 commentaires
Afficher 1 commentaire plus ancien Masquer 1 commentaire plus ancien

Catégories

Tags

Community Treasure Hunt

Faster array comparison with big datasets

9 commentaires Afficher 7 commentaires plus anciens Masquer 7 commentaires plus anciens

Réponses (3)

1 commentaire Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

3 commentaires Afficher 1 commentaire plus ancien Masquer 1 commentaire plus ancien

Catégories

Tags

Voir également

Community Treasure Hunt

9 commentaires
Afficher 7 commentaires plus anciens Masquer 7 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

3 commentaires
Afficher 1 commentaire plus ancien Masquer 1 commentaire plus ancien