Matlab comparison of two large matricies

3 vues (au cours des 30 derniers jours)
Benvaulter
Benvaulter le 26 Mai 2017
Commenté : Benvaulter le 29 Mai 2017
I am trying to retrieve the index of exact matches (row-specific) between two large matricies. I have a n x 61 matrix A containing values from 0 to 9 and another n x 61 matrix B , whereas each row here contains values from 0 to 9 but mostly NaN (only 2 to 8 columns in each row of matrix B contain actual numbers). Matrix A can be expected to have between 1.5 million and 3 million rows, whereas matrix B has around 0.2 to 0.5 million rows. Here is an example of the setup:
% create matrix a with random data
dataSample = [0 9];
numRows = 1000000;
numCols = 61;
A = randi(dataSample,numRows,numCols);
% create matrix B with random data
numRows = 100000;
numCols = 61;
numColsUse = 2:8;
dataRange = 0:9;
B = NaN(numRows,numCols);
for i = 1:size(B,1)
% randomly selet number of columns to fill
numColsFill = datasample(numColsUse,1);
% randomly select column index from available columns
colIdx = datasample([1:numCols],numColsFill);
% randomly select values from 0 to 9
numFill = datasample([0:9],numColsFill);
% insert numbers at respective column in matrix B
B(i,colIdx) = numFill;
end
I want to compare every single row of matrix A with the entire matrix B and find exact matches, where the numbers of matrix B match the numbers of matrix A at their respective positions (columns) - hence the NaN in matrix B are to be ignored.
I can achieve the desired result using cellfun, where I slice matrix A in several subsets and then use a custom function to compare the rows of the subset with each row in matrix B, like so:
% put all rows of matrix B in single cell
cellB = {B};
% take subset of matrix A and convert to cell array
subA = A(1000:5000,:);
subA = num2cell(subA,2);
% prepare cellB to meet cellfun conditions
cellB = repmat(cellB, [size(subA,1) 1]);
% apply cellfun to retrieve index of each exact match
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
Function findMatch looks as follows:
function [ idx ] = LTableEval( cellB, subA )
idxCheckLT = lt(cellB, repmat(subA, [size(cellB,1) 1]));
idxCheckGT = gt(cellB, repmat(subA, [size(cellB,1) 1]));
idxCheck = idxCheckLT + idxCheckGT;
idxSum = sum(idxCheck,2);
idx = find(idxSum == 0);
end
This approach works, but it appears to be very inefficient, especially RAM-wise, as the cellfun requires all inputs to have the same size and hence a multiplication of the same data sets. Any ideas on how to tackle this problem in a more efficient way? Many thanks!

Réponse acceptée

Guillaume
Guillaume le 28 Mai 2017
This is how I'd do it:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
Bcols = find(~isnan(B(Brow, :)));
matches{Brow} = find(all(A(:, Bcols) == B(Brow, Bcols), 2)); %requires R2016b or later
end
It's certainly a lot more efficient than any of the solutions you already have.
Note: in R2015b or earlier replace the relevant line by:
matches{Brow} = find(all(bsxfun(@eq, A(:, Bcols), B(Brow, Bcols)), 2));
  4 commentaires
Guillaume
Guillaume le 29 Mai 2017
Try this:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
Bcols = find(~isnan(B(Brow, :)));
matchedrows = find(all(A(:, Bcols) == B(Brow, Bcols), 2));
matches{Brow} = [matchedrows, repmat(Brow, size(matchedrows))];
end
matches = cell2mat(matches);
finalScores = accumarray(matches(:, 1), matches(:, 2), [size(A, 1), 1], @(Brows) mean(Bscores(Brows)), nan);
However, if most rows of A all match at least some rows of B it may be more efficient to have just one loop over the rows of A. (I regard accumarray as another loop)
Benvaulter
Benvaulter le 29 Mai 2017
It works like a charm - absolutely brillinat - thank you so much!

Connectez-vous pour commenter.

Plus de réponses (1)

Matthew Eicholtz
Matthew Eicholtz le 26 Mai 2017
A couple comments:
1. Did you mean to convert to the cell array in this manner?
subA = num2cell(subA);
If you want to look at each row as its own cell, I think you need:
subA = num2cell(subA,2);
2. I am not sure how much more efficient this solution will be, but you can replace
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
with
idxContainer = cellfun(@(x,y) find(all(isnan(x)|x==y,2)), cellB, subA, 'UniformOutput', false);
Let me know if this helps at all.
  1 commentaire
Benvaulter
Benvaulter le 28 Mai 2017
Hi Matthew,
thanks for your suggestions. First of all, you are correct, I has to be
subA = num2cell(subA,2);
as I want to conduct the operation on each row. Your alternative suggestion regarding the cellfunction returns the following error for me:
Error using ==
Matrix dimensions must agree.
Error in @(x,y)find(all(isnan(x)|x==y,2))
I have found other solutions suggesting matrix substraction, which makes sense to me. However, I am facing two challenges:
1. How to maximize performance when comparing each single row (i.e. what you suggested)
2. How to best setup the overall code (i.e. using cellfun vs. for-loops, vs. other ideas) to maximize speed overall (with these amounts of data, I also believe RAM might be a factor in what I am able to do.
Someone suggested the following solution:
for br = 1:size(B,1)
abs_diff = abs(repmat(B(br,:),[size(A,1) 1]) - A);
abs_diff(isnan(abs_diff)) = 0;
match = abs_diff == 0;
ind = find(sum(match,2)==size(match,2));
matches{br} = [repmat(br,[length(ind) 1]) ind];
end
matches = cell2mat(matches');
which is already a step into the right correction and which does not use "slicing the dataset" into smaller parts. However, each iteration of the loop here is still over a second long and I am wondering whether there are still better ways to do this. Any help is much appreciated - thanks!

Connectez-vous pour commenter.

Catégories

En savoir plus sur Matrix Indexing dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by