Find and reduce a numeric array with identical columns

Dear Sir/Madam,
I would like to ask you the following question:
I have a data file like this
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
I would like re-group (or reduce) it with following conditions:
For any row, if column 1 AND column 3 of this row is identical with any column 1 AND column 3 of any other row. Then reduce to one new row with new value of column 2, this new value of column 2 is the sum of original values of column 2. Column 1 is kept the same, Column 4 is not important.
So, for above data, I expect to have the answer:
119 5 6832 0.20313 % 5+4-4=5
122 -48 6853 0.29688
125 -8 6853 0.4375 % -12+4=-8
121 12 6914 0.5625
119 48 6914 0.17188
What Matlab command to use? I would greatly appreciate it if you left your code and running output.
I am using MATLAB R2014a.
Thank you very much

3 commentaires

Stephen23
Stephen23 le 30 Déc 2018
Modifié(e) : Stephen23 le 30 Déc 2018
The order of the rows in your output is not clearly defined. What is the rule to get that order?
For example, both 121 and 122 each only occur once in the first column, but in the output matrix are listed neither in the sequence that they occur in the input matrix, nor in numeric order. How is this order supposed to be determined?
I was wondering the same thing. Hopefully the order doesn't matter. I'm sure you could write the code afterwards in such a ways that it didn't matter.
John Smith
John Smith le 30 Déc 2018
Modifié(e) : John Smith le 30 Déc 2018
In tmp (data file ), the order of the rows were randomly inputed by hand, no order at all.

Connectez-vous pour commenter.

 Réponse acceptée

Stephen23
Stephen23 le 30 Déc 2018

0 votes

>> [~,X,Y] = unique(tmp(:,[1,3]),'rows');
>> out = tmp(X,:);
>> out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
out =
119.000000 5.000000 6832.000000 0.023438
119.000000 48.000000 6914.000000 0.171880
121.000000 12.000000 6914.000000 0.562500
122.000000 -48.000000 6853.000000 0.296880
125.000000 -8.000000 6853.000000 0.437500

7 commentaires

John Smith
John Smith le 30 Déc 2018
Modifié(e) : John Smith le 30 Déc 2018
Thank you very much Stephen,
I need further the output file "out" to be re-rewriten as classified by the value of first column and column 3 of "out":
  1. If the values of the third column are not the same. then separated them: according to the value of third column:
out_new(:,:,1) =
119 5 6832 0.20313
0 0 0 0
out_new(:,:,2) =
122 -48 6853 0.29688
125 -8 6853 0.078125
out_new(:,:,3) =
119 48 6914 0.17188
121 12 6914 0.5625
2 then the output rewritten according to the value of first column and aligned all group according to the value of the first column.
3. The value of column 1 should be decreased order in value.
I expect the output file as the following format:
125 -8 6853 0.078125
122 -48 6853 0.29688
121 12 6914 0.5625
119 5 6832 0.20313 48 6914 0.17188
By the way, my code to obtain out_new is (may be not so good:)
length(out(:,1));
[C, ia, ic] = unique(out(:,3),'sorted');
L = length(C)
for j=1 : L
k=0;
for i=1 : length(out(:,1))
if C(j) == out(i,3)
k=k+1;
out_new(k,1,j) = out(i,1) ;
out_new(k,2,j) = out(i,2) ;
out_new(k,3,j) = out(i,3) ;
out_new(k,4,j) = out(i,4) ;
end
end
end
format shortG
out_new
Thank you very much
Wow, this is starting to get very quirky and complicated. Can't you take it from here? Why do you need this super specialized format anyway? What's the use case?
John Smith
John Smith le 31 Déc 2018
Modifié(e) : John Smith le 1 Jan 2019
Dear Sir/Madam,
Actually, I also need column 3 to be converted to date, then the requried output format is more meaningful: (column 4 is the costs of options)
out(:,3)
6832
6914
6914
6853
6853
datestr(out(:,3))
ans =
14-Sep-0018
05-Dec-0018
05-Dec-0018
05-Oct-0018
05-Oct-0018
then the requried output format should be:
125 -8 05-Oct-0018 0.078125
122 -48 05-Oct-0018 0.29688
121 12 05-Dec-0018 0.5625
119 5 14-Sep_0018 0.20313 48 05-Dec-0018 0.17188
Now you can see, the horizontal axis is time, the vertical axis (column 1) is strik-price of option.
In this way my option trading positions are:
e. g. for 119 strike price, I have two kinds of options: bought 5 options with expiration date 14-September, and bought 48 options with expirarion date 05-Dec.
for 122 strike price option, I have only one kind option: sold 48 options with expiration date 05-Oct.
That is why this kind of output format in important.
Thank you very much for your help.
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
% Convert to character:
baz = num2cell(out(:,2:4)).';
baz(2,:) = cellstr(datestr(out(:,3)));
baz = regexp(sprintf('%3d %s %g\v',baz{:}),'[^\v]+','match');
% Arrange into matrix:
[U,~,X1] = unique(out(:,1));
[V,~,X3] = unique(out(:,3));
C = accumarray([X1,X3],baz,[],@(s)s);
L = cellfun('length',C(:));
C(L==0) = {''};
C = flipud([num2cell(U),C]).'
% Print:
fmt = sprintf('%%-%ds',2+max(L));
fmt = repmat(fmt,1,numel(V));
fmt = ['%4d',fmt,'\n'];
fprintf(fmt,C{:})
Prints this:
125 -8 05-Oct-0018 0.4375
122 -48 05-Oct-0018 0.29688
121 12 05-Dec-0018 0.5625
119 5 14-Sep-0018 0.023438 48 05-Dec-0018 0.17188
John Smith
John Smith le 1 Jan 2019
Modifié(e) : Image Analyst le 1 Jan 2019
Dear Stephen,
Thank you very much for your help, and Happy NEW YEAR 2019. Running this:
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
% Convert to character:
baz = num2cell(out(:,2:4)).';
baz(2,:) = cellstr(datestr(out(:,3)));
baz = regexp(sprintf('%3d %s %g\v',baz{:}),'[^\v]+','match');
% Arrange into matrix:
[U,~,X1] = unique(out(:,1));
[V,~,X3] = unique(out(:,3));
C = accumarray([X1,X3],baz,[],@(s)s);
L = cellfun('length',C(:));
C(L==0) = {''};
C = flipud([num2cell(U),C]).'
% Print:
fmt = sprintf('%%-%ds',2+max(L));
fmt = repmat(fmt,1,numel(V));
fmt = ['%4d',fmt,'\n'];
fprintf(fmt,C{:})
I got this error:
Error using accumarray
Second input VAL must be a full numeric, logical, or char vector or scalar.
Error in main_ZN_Dec_7_2018_trade_4 (line 364)
C = accumarray([X1,X3],baz,[],@(s)s)
Stephen23
Stephen23 le 1 Jan 2019
Modifié(e) : Stephen23 le 1 Jan 2019
Replace the line with accumarray with these three lines:
S = max([X1,X3]);
C = cell(S);
C(sub2ind(S,X1,X3)) = baz;
and a Happy New Year!
John Smith
John Smith le 2 Jan 2019
Modifié(e) : John Smith le 3 Jan 2019
Dear Stephen,
By changing those three lines, you code works.
Back to the question in the very beginning, I said " Column 4 is not important". but now, I need to treat column 4 the same way as column 2:
tmp =
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438
out =
119 5 6832 0.265631 (%=0.20313+0.039063+0.023438)
119 48 6914 0.17188
121 12 6914 0.5625
122 -48 6853 0.29688
125 -8 6853 0.515625 (%=0.078125+0.4375)
how do you change your three line code:
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
I tried to modify the following way, it did not work:
out(:,2) = accumarray(Y,tmp(:,2), tmp(:,4), [],@sum)
However, when I use two lines (two time) then it worked:
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
out(:,4) = accumarray(Y,tmp(:,4),[],@sum)
Thank you very much.

Connectez-vous pour commenter.

Plus de réponses (1)

Image Analyst
Image Analyst le 30 Déc 2018
Modifié(e) : Image Analyst le 30 Déc 2018
What about using grpstats(), if you have the Statistics and Machine Learning Toolbox.
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
col5 = 10000*tmp(:, 1) + tmp(:, 3)
tmp = [tmp, col5];
% No sum in grpstats, so have to do it twice.
% Once to get the mean and once to get the count.
outputMean = grpstats(tmp, tmp(:, 5), 'mean')
outputNumel = grpstats(tmp, tmp(:, 5), 'numel')
% Crop off temporary 5th column
output = outputMean(:, 1:4) % Initialize
% Column 2 is the sum = mean * count
output(:, 2) = outputMean(:, 2) .* outputNumel(:, 2)
The output seems to be sorted by the first column though:
output =
119 5 6832 0.088544
119 48 6914 0.17188
121 12 6914 0.5625
122 -48 6853 0.29688
125 -8 6853 0.25781
That might be a problem for you. I'm not sure. Of course column 4 can be cropped off or ignored since you say it's not important.

Produits

Version

R2014a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by