Using strcmpi (or similar) to extract indexes with multiple strings

7 vues (au cours des 30 derniers jours)
MILLER BIGGELAAR
MILLER BIGGELAAR le 25 Oct 2022
Commenté : Walter Roberson le 25 Oct 2022
Hi team,
I need to be able to locate the row in a spreadsheet which contains a string in column A AND a string in column B
The strings are different but remain the same throughout my large set of files, can I use strcmpi to do this (potentially with an & or something)?
I have ~90k files to go through, some files contain additional data above the specific row I need which is why I need to use an index for it and not use a simple range technique
The code below is for a different set of data that I used strcmpi for, is there anything that can be done to simply modify this?
rowix = [];
file = convertCharsToStrings(filedir(3).name);
[num,txt,raw] = xlsread(file);
for k = 1:length(duid)
p = strcmpi(duid(k,1),raw(:,6));
rownum = find(p==1);
rowix = [rowix; rownum];
end
rowix = sort(rowix);
will the ismember() function be better here?
Thanks
  1 commentaire
Walter Roberson
Walter Roberson le 25 Oct 2022
I have ~90k files to go through
I recommend that you switch from xlsread() to readtable() .

Connectez-vous pour commenter.

Réponses (1)

Walter Roberson
Walter Roberson le 25 Oct 2022
will the ismember() function be better here?
You specifically use strcmpi() so we will assume that you need a case-insensitive match. In such a situation, to use ismember() you would need to ismember(lower(TERM), lower(TARGET))
ismember() is known to work by sorting the target and then doing binary searches. sorting is an O(n*log(n)) execution time, and then binary search is O(log(n)) so the overall search time for ismember(A,B) would be roughly numel(B)*log(numel(B)) + numel(A) * log(numel(B))... which would work out as roughly O( log(numel(B)) * (numel(A) + numel(B)) )
The method used by strcmpi() is probably linear. Where ismember(A,B) permits looking for each member of A in each member of B, strcmpi(A,B) requires A to be scalar, or else A must be the same size as B and corresponding elements are compared. Thus, strcmpi(A,B) does not have any need to be able to efficiently compare multiple elements in A to multiple elements in B. A linear search that compared each element of A to each element of B would require numel(A) * numel(B) time -- which could be considerably higher than O( log(numel(B)) * (numel(A) + numel(B)) ) . So having both algorithms available is useful.
You are comparing multiple duid to multiple raw(:,6) so potentially ismember() would be faster if you compared everything in one call
Let us try some timing...
letters = ['A' : 'Z', 'a' : 'z', '0' : '9'];
numletters = numel(letters);
N = 1e5;
raw6 = cell(N, 1);
for K = 1 : N
raw6{K} = letters(randi(numletters, 1, randi(20)));
end
M = 25;
duid = letters(randi(numletters, 1, 10));
strcmp_nomatch = zeros(M,1);
member_nomatch = zeros(M,1);
lower_nomatch = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_nomatch(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_nomatch(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_nomatch(K) = stop; end
duid = upper(raw6{1});
strcmp_first = zeros(M,1);
member_first = zeros(M,1);
lower_first = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_first(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_first(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_first(K) = stop; end
duid = upper(raw6{floor(end/2)});
strcmp_middle = zeros(M,1);
member_middle = zeros(M,1);
lower_middle = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_middle(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_middle(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_middle(K) = stop; end
duid = upper(raw6{end});
strcmp_last = zeros(M,1);
member_last = zeros(M,1);
lower_last = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_last(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_last(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_last(K) = stop; end
strcmp_means = mean( [strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last] );
member_means = mean( [member_nomatch, member_first, member_middle, member_last] );
lower_means = mean( [lower_nomatch, lower_first, lower_middle, lower_last] );
T = array2table([strcmp_means; member_means; lower_means], ...
'VariableNames', {'no match', 'first', 'middle', 'last'}, ...
'RowNames', {'strcmpi', 'ismember', 'lower() only'} );
disp(T)
no match first middle last ________ __________ __________ _________ strcmpi 0.000768 0.00083788 0.00085612 0.0008514 ismember 0.020717 0.020665 0.020921 0.021554 lower() only 0.01916 0.018824 0.019362 0.019111
plot([strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last, ...
member_nomatch, member_first, member_middle, member_last, ...
lower_nomatch, lower_first, lower_middle, lower_last] );
legend( {'strcmpi no match', 'strcmpi first', 'strcmpi middle', 'strcmpi last', ...
'ismember no match', 'ismember first', 'ismember middle', 'ismember last', ...
'lower() no match', 'lower() first', 'lower() middle', 'lower() last' }, ...
'location', 'eastoutside');
What does this tell us? Well, it tells us that the great majority of the time difference between the strcmpi() approach and the ismember() approach is mostly due to the calls to lower() that are needed for the case where the cases of the values do not match. If it happened to be the case that the stored data in raw(:,6) was already all lower-case or all upper-case, then most of the time could be avoided.
(But if you look at the difference between the lower() only version and the ismember() time, you can see that the ismember() version must still have been slower than the strcmp()
  1 commentaire
Walter Roberson
Walter Roberson le 25 Oct 2022
And what if case sensitivity is used so we do not need to take into account converting case?
letters = ['A' : 'Z', 'a' : 'z', '0' : '9'];
numletters = numel(letters);
N = 1e5;
raw6 = cell(N, 1);
for K = 1 : N
raw6{K} = letters(randi(numletters, 1, randi(20)));
end
M = 25;
duid = letters(randi(numletters, 1, 10));
strcmp_nomatch = zeros(M,1);
member_nomatch = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_nomatch(K) = stop; end
for K = 1 : M; start = tic; ismember(duid, raw6); stop = toc(start); member_nomatch(K) = stop; end
duid = raw6{1};
strcmp_first = zeros(M,1);
member_first = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_first(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_first(K) = stop; end
duid = raw6{floor(end/2)};
strcmp_middle = zeros(M,1);
member_middle = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_middle(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_middle(K) = stop; end
duid = raw6{end};
strcmp_last = zeros(M,1);
member_last = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_last(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_last(K) = stop; end
strcmp_means = mean( [strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last] );
member_means = mean( [member_nomatch, member_first, member_middle, member_last] );
T = array2table([strcmp_means; member_means], ...
'VariableNames', {'no match', 'first', 'middle', 'last'}, ...
'RowNames', {'strcmpi', 'ismember'} );
disp(T)
no match first middle last __________ __________ _________ __________ strcmpi 0.00071792 0.00063516 0.0006624 0.00067148 ismember 0.001008 0.018796 0.01851 0.018458
plot([strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last, ...
member_nomatch, member_first, member_middle, member_last] );
legend( {'strcmp no match', 'strcmp first', 'strcmp middle', 'strcmp last', ...
'ismember no match', 'ismember first', 'ismember middle', 'ismember last'}, ...
'location', 'eastoutside');
You can see that strcmp() slows down as it has to search further to find the match, but that ismember() does not. But strcmp() is still more than twice as fast.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Large Files and Big Data dans Help Center et File Exchange

Produits


Version

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by