How to use map and reduce efficiently?

6 vues (au cours des 30 derniers jours)
Daniel Pinto
Daniel Pinto le 8 Août 2019
Modifié(e) : Daniel Pinto le 9 Août 2019
I have the following table, which has over 40 million rows and 5 columns:
Screenshot 2019-08-08 at 15.58.48.png
The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID - some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.
I wish to do 2 things:
1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.
2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.
3) divide A by B.
This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.
ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);
where the functions are defined as
function gvkeyMapFun2(data, ~, intermKVStore)
% gets quarter variables
vQuarter = num2str(data.rdate); % char format
% gets cusip in char format
vNCUSIP = cell2mat(data.cusip);
% creates quarter-ncusip identifer
IDnum = strcat(vQuarter,vNCUSIP);
IDnum = cellstr(IDnum);
% finds unique NCUSIPS-quarter
[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
% gets variables of intersst
dataOwnership = cellfun(@(x) str2double(x),data.shares);
dataTotalShares = data.shrout2;
for ii = 1:numel(intermKeys)
totalOwnership = sum(dataOwnership(idx==ii));
totalShares = mode(dataTotalShares(idx==ii));
totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ];
add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
end
end
and
function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
databasereducedFinal = array2table([]);
while hasnext(intermValIter)
databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
end
add(outKVStore, 'output', databasereducedFinal);
end
I then run
output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:});
to try and get the table because "output" looks like this:
Screenshot 2019-08-08 at 12.12.29.png
I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.
thank you.
  5 commentaires
Daniel Pinto
Daniel Pinto le 8 Août 2019
Thank you. Im trying splitapply. I ll update here on my progress.
Yes, it's weird imposing shares be read as text. so with map and replace, if I do not set
ds.TextscanFormats{4} = '%q';
I was getting an error message in the mapping phase . so i was doing that to avoid that error message, and then get it back to numeric:
dataOwnership = cellfun(@(x) str2double(x),data.shares);
Daniel Pinto
Daniel Pinto le 8 Août 2019
I am trying splitapply. Before doing the splitapply, I am trying to define a group based on a quarter-cusip identifier. to do that, i try to concatenate quarter and cusip in string format. the problem is that the strcat of those two is taking for hours now... my computer has 16gb of ram, shouldn't that be enough despite the fact that I am dealing with 40 million rows?

Connectez-vous pour commenter.

Réponse acceptée

Guillaume
Guillaume le 8 Août 2019
Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},
t = tall(ds);
[group, rdate, cusip] = findgroups(t.rdate, t.cusip);
shareratio = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);
result = gather(table(rdate, cusip, shareratio));
I was getting an error message
What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.
  5 commentaires
Guillaume
Guillaume le 9 Août 2019
Modifié(e) : Guillaume le 9 Août 2019
Your mapreduce algorithm is problematic as it assumes that you get all the values (sum and mode) associated with a key (pair of rdate, cusip) in just one map pass. If the data in your table is not sorted by the key, it certainly won't be the case and even if it is, it may be that a key straddles two calls to map.
E.g. let's assume your data is sorted and in one of the call to gvkeyMapFun3, you get:
rdata cusip shares shrout2
---------------------------------------------
... ... ... ...
... ... ... ...
'J' 'KK' 123 88
'X' 'YY' 111 1
'X' 'YY' 123 1
'X' 'YY' 245 2
'X' 'YY' 785 2
'X' 'YY' 951 1
On the next call to gvkeyMapFun3, you get:
rdata cusip shares shrout2
---------------------------------------------
'X' 'YY' 632 2
'X' 'YY' 421 3
'X' 'YY' 248 2
'X' 'YY' 892 3
'X' 'YY' 230 3
'Z' 'NN' 673 45
... ... ... ...
As you can see the mode of your key pair {'X', 'YY'} is 2 which you never got with your current algorithm.
Coping with that significantly complicate the mode calculation, as you can't calculate the mode in the mapping function. Instead you have to keep around the histogram of all the values of shrout2 and calculate the mode yourself in the reduce function, once you've combined all the histograms for the key.
Here is something that should work properly:
function gvkeyMapFun3(data, ~, intermKVStore)
keys = strcat(data.rdate, '-', data.cusip); %may be a better way to build your keys. Will be easier to separate afterwards
[intermKeys, ~, keyid] = unique(keys, 'stable'); %I don't see the point of 'stable'. Probably makes no difference
sharesum = accumarray(keyid, data.shares); %sum up the shares that match the same keyid
shrouthist = accumarray(keyid, data.shrout2, [], @tempmode); %compute histogram of shrout2 for each keyid
values = num2cell([num2cell(sharesum), shrouthist], 2); %for each key store a 1x2 cell array, 1st element is the share sum, 2nd element is a 2 column matrix [values, counts]
addmulti(intermKVStore, intermKeys, values);
end
%can be local in gvKeyMapFun3.m
function modehist = tempmode(shrout2)
%compute the histogram of shrout2, receives values that match just one key
%returns a scalar cell array containing a Nx2 matrix. First column is unique values, 2nd column is their histogram
[shroutval, ~, id] = unique(shrout2);
shroutcount = accumarray(id, 1);
modehist = {[shroutval, shroutcount]};
end
function gvkeyReduceFun3(intermKey, intermValIter, outKVStore)
sharesum = 0;
shrouthist = zeros(0, 2);
while hasnext(intermValIter)
values = getnext(intermValIter);
sharesum = sharesum + values{1};
shrouthist = [shrouthist; values{2}]; %#ok<AGROW>
end
%now that we have the full histogram of shrout2, we can calculate the mode
[allshrout, ~, id] = unique(shrouthist(:, 1));
count = accumarray(id, shrouthist(:, 2));
[~, moderow] = max(count);
shroutmode = allshrout(moderow);
result = sharesum / 1000 / shroutmode; %Not sure where the 1000 comes from. Was never mentioned before
add(outKVStore, intermKey, result);
end
Daniel Pinto
Daniel Pinto le 9 Août 2019
Modifié(e) : Daniel Pinto le 9 Août 2019
Thank you Guillaume for taking the time. It turns out in this case the correlation between the 2 outputs is 0.999, but you do make a very good point and I changed the code accordingly. This was very helpful and instructive, thank you!
ps: the multiplication by 1000 is just to scale the variable properly.

Connectez-vous pour commenter.

Plus de réponses (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by