Faster array comparison with big datasets

I am trying to compare two large arrays to each other and do some extra operations. So far it works but it is very slow. Coming from a little C++ background, I can use pointers to their location instead of taking the value of each. I have found how to do so in matlab using shrlib but I am not sure if it works for doubles.
What I have available:
Two textfiles each in the following format: name in col 1, value1 in col2 and value2 in col 3
filename value1 value2
---------------------------------------
foo0001 100 200
foo0001 140 200
foo0001 135 155
foo0002 124 145
foo0002 135 155
foo0003 1419 1943
. . .
. . .
I would want to:
  1. compare the values of value1 and value2 from the first file to those of the second file
  2. be tolerant to errors with respect to elements from value1 and value2
  3. the file comparison to run fast ( my current method is very slow)
So here is what I have done so far:
file1 = readtable('file1.txt', 'Delimiter',' ','ReadVariableNames',false);
fclose(fileID);
file2 = readtable('file2.txt', 'Delimiter',' ','ReadVariableNames',false);
fclose(fileID);
n = 1000;
detectedCorn = zeros(n,2); %initialize large nx2 vectors to save value1 and value2
referenceCorn = zeros(n,2);
null = ['foo000',int2str(ip)]; % initialize filename as "dynamic" string
for ip = 1:100
iv = 1; % initialize index variables to use later on
iv2 = 1; %
for ii = 1:numel(file1.Var1) %%all Var1 -> name in first col in file1 table
if strcmp(file1.Var1(ii), null) == 1;
var = [file1.Var2(ii),file1.Var3(ii)];
detectedCorn(iv,:) = var;
var = zeros(1,2);
iv = iv + 1;
end
%REPEAT SAME STEP AS ABOVE FOR file2( I will skip that so the code doesnt get too crowded)
for kk = 1:1000
for rr = 1:1000
var3 = [0,0];
if all(ismember(detectedCorn(kk,:),var3)) == 0 &&...
all(ismember(referenceCorn(kk,:),var3)) == 0 %%do not count the empty spaces in the array
if all(ismember(detectedCorn(kk,:),referenceCorn(rr,:))) == 1 %%iterate through the two vectors and compare
% do something
else
% do something else
end
end
end
end
end
end
---------------------------------------------------------------------------
Clarification:
For example I have these files:
file 1: Reference
imagename.png value1 value2
000000004.png 100 200
000000004.png 120 130
000000004.png 150 1500
000000005.png 1214 501
file 2: Detected
imagename.png value1 value2
000000004.png 100 200
000000004.png 204 300
000000004.png 300 400
000000004.png 129 140
000000004.png 152 1499
000000005.png 1200 500
What I mean by compare, I want to do the following: for example the first row in the Detected file (file2) I have row1. First thing I have to make sure that the filenames in the two files are the same. If that so then I can compare the data in x and y. if difference of value1 from Detected file (in this example 100) and the value1 from Reference file is equal to 0,1,2,-1,-2 then, check if difference of value2 of the same row in both files equal to 0,1,2,-1,-2, increment the variable TP (Positive match) with 1. And so on for all files... By that logic the for the above data TP = 2, FP = data found in Detected but not in reference and FN is data found in reference but not in detected.
Notice that file detected has way more entries than the first file and does not necessarily have the same filenames

9 commentaires

If you come from a C++ background you may want to create a mex function to do the comparison in C++ instead.
doc mex
Stephen23
Stephen23 le 5 Mar 2015
Modifié(e) : Stephen23 le 5 Mar 2015
You can upload your files using the paperclip button, this way we can see the whole code.
If you come from a C++ background, then you need to learn about code vectorization , which is much faster than using loops in MATLAB.
KBK007
KBK007 le 5 Mar 2015
I am just curious how to do it in MATLAB syntax
Stephen23
Stephen23 le 5 Mar 2015
Modifié(e) : Stephen23 le 5 Mar 2015
That is why I wrote "then you need to learn about code vectorization", because vectorization is the best, fastest, and neatest MATLAB syntax. Forget about solving problems using loops, that is C++ syntax.
Vectorization is your first choice in MATLAB, loops are your second.
The documentation states: MATLAB® is optimized for operations involving matrices and vectors. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. Vectorizing your code is worthwhile for several reasons:
  • Appearance: Vectorized mathematical code appears more like the mathematical expressions found in textbooks, making the code easier to understand.
  • Less Error Prone: Without loops, vectorized code is often shorter. Fewer lines of code mean fewer opportunities to introduce programming errors.
  • Performance: Vectorized code often runs much faster than the corresponding code containing loops.
You might also be interested in these webinars:
If you uploaded your whole code files, we could give some more concrete advice too!
KBK007
KBK007 le 5 Mar 2015
I do understand that, thank you for the hints. I uploaded my file it is edited in my question.
Stephen23
Stephen23 le 5 Mar 2015
Thank you for uploading the code. Although I can't run it without your data, it is clear that will be quite slow due to calling ismember multiple times in the deepest loops, and some slow referencing style, and the use of nested loops anyway.
I think that it would much faster if we start from scratch: what is the result that you want, any what inputs do you have? As far as I can tell, you are basically counting the total matches in the two columns, subject to some conditions. Can you explain this in words ?
Some uploaded data files would be nice too (can be shortened to a few tens of lines).
KBK007
KBK007 le 5 Mar 2015
Modifié(e) : KBK007 le 5 Mar 2015
Hey stephen, at the moment I do not have the right to upload the dataset I have (the reference one since it is not produced by me). Unfortunately I will have to tell you (in words) what I require from my program. The input style is like the one above, I have two text files having the format:
imagename.png value1 value2
000000004.png x y
000000004.png x2 y2
000000004.png x3 y3
000000005.png x4 y4
(this explains the ip = 4:169 the last image i have is 0000000004.png and the last is 0000000169.png hence the 2 if statements)
The reference file has less entries for EACH filename than the ones in the DetectedCorners file. So i could have 100 rows for 0000000004.png in the first file and 700 in the other. However since the data in the ReferenceCorner are the ones I take reference, all the data belonging to image 0000000004.png in the detected corners have to be compared to the ones corresponding only to 0000000004.png(that is same filename). And yes the goal is counting and producing how many of those are true. (to give you an idea why there are 3 values in each row, I have processed images and I am comparing the pixel locations I detected with my algorithm w.r.t a reference to see how good or bad my algorithm is and perform optimization operations on it)
KBK007
KBK007 le 5 Mar 2015
stephen I know this is gona be a dumb question to ask, but I guess it is something you guys have encountered before, isn't a table "vectorization" ? it is a nx3 dimensional array is it not ? or should I put it in a matrix structure and then pull some operations on it? I am going to check out the definition of a table and try to put the data in a nx2 (for each image) matrix. Is that what you mean ?
Oleg Komarov
Oleg Komarov le 2 Avr 2015
Have you tried running the profiler to see where the bottlenecks are?

Connectez-vous pour commenter.

Réponses (3)

Stephen23
Stephen23 le 5 Mar 2015
Modifié(e) : Stephen23 le 5 Mar 2015
My version of MATLAB is a little long in the tooth and does not support tables, so I used simple numeric arrays instead, but this code should get you started on your way. One big hint for using MATLAB: learn to navigate and use the documentation , it is actually very helpful.
fid = fopen('tempDet.txt','rt');
DetC = textscan(fid,'%s%f%f','HeaderLines',2,'CollectOutput',true);
fclose(fid);
fid = fopen('tempRef.txt','rt');
RefC = textscan(fid,'%s%f%f','HeaderLines',2,'CollectOutput',true);
fclose(fid);
[DetV,~,DetX] = unique(DetC{1});
[RefV,~,RefX] = unique(RefC{1});
assert(all(strcmp(DetV,RefV)),'Some filenames are missing in one of the files!')
for k = reshape(unique(DetX),1,[])
DetK = DetX==k;
RefK = RefX==k;
DetN = sum(DetK)
RefN = sum(RefK)
intX = intersect(DetC{2}(DetK,:),RefC{2}(RefK,:),'rows')
intN = size(intX,1)
end
It opens each file (attached), checks that the filenames correspond, then for each filename finds the ones with matching numeric values (both values in the row must be the same). The sums and the intN tell you how values there are for each file, and many match correctly, respectively. The (rather simple) data files that I used are attached here:

1 commentaire

KBK007
KBK007 le 6 Mar 2015
Thank you stephen for your time and contribution. My only concern is this program seems to require that for example the foo0001 from the first file MUST have a corresponding foo0001 in the other file. However since I am testing an algorithm against another, I may detect some false positives and may have data for foo0001 despite the fact that in the reference file there is no foo. I will look at it myself but thank you for the example of vectorization, now i sorta get what you were aiming for.

Connectez-vous pour commenter.

Peter Perkins
Peter Perkins le 5 Mar 2015

0 votes

KBK007, as Stephen said, your code is unvectorized, and that's why it is slow. It is also complicated enough that its purpose kind of hard to figure out. It seems like you're doing something along the lines of finding data in one file that are also present in another file. That may be related to a join operation, or some kind of intersection, or even some kind of grouped calculation. There are functions specifically for those operations, but it's really hard to say what you should be using without a clear description of what data you have and what you want do with it. In words, not in code. You should be able to make a very small pair of example data files, and show exactly what results you want from that.
Given Stephen's interpretation of your code, and some similar data, here is a vectorized version that prints out
  • filenames that occur in only one of the two tables
  • the number of rows for each filename in each table
  • rows with filename/value1 in common between the two tables
Not sure if this does what you're looking for, but at least it will give you an idea of what is meant by "vectorization". Hope it helps.
>> DetC = table( ...
{'foo0001'; 'foo0001'; 'foo0001'; 'foo0002'; 'foo0002'; 'foo0003'; 'foo0004'}, ...
[101; 141; 133; 124; 135; 1419; 1420], ...
[199; 200; 154; 145; 152; 1933; 1934], ...
'VariableNames',{'filename' 'value1' 'value2'});
>> RefC = table( ...
{'foo0001'; 'foo0002'; 'foo0002'; 'foo0003'}, ...
[141; 124; 123; 1419], ...
[200; 144; 130; 1933], ...
'VariableNames',{'filename' 'value1' 'value2'});
>> % Find filenames that occur in only one of the two tables
>> setxor(DetC.filename,RefC.filename)
ans =
'foo0004'
>> % Find the number of rows for each filename in each table
>> [~,~,uindices] = unique(DetC.filename);
>> accumarray(uindices,1)
ans =
3
2
1
1
>> [~,~,uindices] = unique(RefC.filename);
>> accumarray(uindices,1)
ans =
1
2
1
>> % Find the rows with filename/value1 in common between the two tables
>> intersect(DetC(:,1:2),RefC(:,1:2))
ans =
filename value1
_________ ______
'foo0001' 141
'foo0002' 124
'foo0003' 1419

3 commentaires

KBK007
KBK007 le 9 Mar 2015
hey peter, thanks for the answer. I get how vectorization works, I read through some tutorials and some books and figured it out. I edited my question and clarified it with an example
KBK007
KBK007 le 10 Mar 2015
Modifié(e) : KBK007 le 10 Mar 2015
I don't think vectorization can be applied here. I don't have the same data size and when comparing matrices or the difference between the values, each value has to be present in both. This means the dimensions should agree otherwise I think it is loopy loop for me. Also I really want to get the difference of each value against all others in the other file, it is compulsory to use a loop
Stephen23
Stephen23 le 2 Avr 2015
Modifié(e) : Stephen23 le 2 Avr 2015
You can still compare data in differently-sized groups by using bsxfun. It is a very fast and efficient tool that can be used in vectorized code, and allows you to compare differently sized vectors very quickly. You can do something like this:
>> X = [1,2,4,6,8];
>> Y = [1,3,4,5,7,8,9];
>> bsxfun(@eq,X(:),Y)
ans =
1 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 1 0
You can then use any and all or other logical operations to select which values you want to look at. And of course you can choose or define the function use in bsxfun (I used the inbuilt eq).
If you decide to go with using a loop, then you really need to read this:

Connectez-vous pour commenter.

Tags

Question posée :

le 5 Mar 2015

Modifié(e) :

le 2 Avr 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by