Rearrange cell content by groups

There's got to be a simple way to do this...
Have a cell array of data by date wherein there are N (or fewer) instances of M quantities for each date. I need to rearrange to reflect each of the M quantities for a given date on a single row. For example given
Date Fund Shares Value
___________ _____________________________________________ _______________ ______
30-Nov-2015 'American International Growth & Income Fund' '67.958 shares' 1979.6
30-Nov-2015 'American Mutual Fund' '54.930 shares' 1997.3
30-Nov-2015 'Capital Income Builder' '34.843 shares' 1997.2
30-Nov-2015 'Europacific Growth Fund' '41.528 shares' 1988.8
30-Nov-2015 'Income Fund of America Fund' '95.102 shares' 2001.9
31-Dec-2015 'American International Growth & Income Fund' '67.958 shares' 1936.6
31-Dec-2015 'American Mutual Fund' '54.930 shares' 1952.1
31-Dec-2015 'Capital Income Builder' '34.843 shares' 1964.2
31-Dec-2015 'Europacific Growth Fund' '41.528 shares' 1941.1
31-Dec-2015 'Income Fund of America Fund' '95.102 shares' 1975.3
31-Jan-2016 'American International Growth & Income Fund' '67.958 shares' 1847.1
31-Jan-2016 'American Mutual Fund' '54.930 shares' 1894.5
31-Jan-2016 'Capital Income Builder' '34.843 shares' 1939.6
31-Jan-2016 'Europacific Growth Fund' '41.528 shares' 1823
31-Jan-2016 'Income Fund of America Fund' '95.102 shares' 1930.4
...
30-Sep-2017 'American International Growth & Income Fund' '67.958 shares' 44.02
30-Sep-2017 'Europacific Growth Fund' '41.528 shares' 101.21
31-Oct-2017 'American International Growth & Income Fund' '67.958 shares' 44.3
31-Oct-2017 'Europacific Growth Fund' '41.528 shares' 103.35
30-Nov-2017 'American International Growth & Income Fund' '67.958 shares' 44.3
30-Nov-2017 'Europacific Growth Fund' '41.528 shares' 103.35
one observes there are five Funds for each of the first few dates but only two at the end. The desired output would be
Date Fund1 Shares1 Value1 Fund2 Shares2 Value2 ...
___________ _________________ _______________ ______ _________________ _______________ ____________ ...
30-Nov-2015 'American Int...' '67.958 shares' 1979.6 'American Mut...' '54.930 shares' 1997.3 ...
31-Dec-2015 'American Int...' '67.958 shares' 1936.6 'American Mut...' '54.930 shares' 1952.1 ...
...
30-Nov-2017 'American Int...' '67.958 shares' 44.3 '' '' '' ...
where I truncated the lines showing only the first two of five. Of course, there would be empty cells at the end where there are only the two of five still extant.
It's easy enough to get grouping indices; I just haven't figured out the simple way to process them to build the output array from the grouping numbers.

1 commentaire

dpb
dpb le 23 Déc 2018
It's been a year since I posted...I had to drop the problem for many months owing to more pressing matters until this past week I returned (with yet another year's data to add to the mix!) still with individual entries only documented as comments...
However, I did get the initial "stripper" completed last night and now do have all those from all years and all months from the inception of the capaital campaign initial gift so I will be able to then got through entry by entry and discover where the disconnect first occurred and, hopefully, that will point to the data entry error...
MC & HNY to all!!!, just wanted to let know that the input did bear fruit, eventually... :)

Connectez-vous pour commenter.

 Réponse acceptée

Sean de Wolski
Sean de Wolski le 5 Jan 2018
Modifié(e) : Sean de Wolski le 5 Jan 2018
If you use the table datatype, which I strongly encourage over cells(!), then this is a simple one-liner. You can also specify the new variable names as an option if the defaults don't suit.
unstack(dpbtable, {'Shares', 'Value'}, 'Fund')
Attached is a MAT file with dpbtable.
More: Just for fun and because accumarray is my favorite MATLAB function, here's a cell version that gets roughly what you want:
dpbcell = table2cell(dpbtable)
% Unique of each date and fund
[udate,~,idxdate] = unique(vertcat(dpbcell{:,1}));
[ufund,~,idxfund] = unique(vertcat(dpbcell{:,2}));
% Accumulate date x fund
[num2cell(udate), accumarray([idxdate, idxfund], (1:numel(idxfund))', [], @(x){dpbcell(x, 2:4)}, {nan})]

13 commentaires

Cedric
Cedric le 5 Jan 2018
Ah! This looks immediately more concise and clean!
Sean, do you know if tables will be implemented as a built-in type? I am a bit ambivalent about them now, because they suffer limitations inherent to MATLAB OOP (associated with indexing, overloading SUBSASGN/REF, NUMEL/nargout, and nesting).
Sean de Wolski
Sean de Wolski le 5 Jan 2018
Modifié(e) : Sean de Wolski le 5 Jan 2018
I don't know. Is performance with subsref/etc. the limitation you see with the MATLAB class version or something else? Parfor and C code generation support are the two reasons I've had to revert but otherwise use them for pretty much everything now.
Peter Perkins, who often posts on here, is the lead developer for tables and timetables who would probably know or take feedback.
Cedric
Cedric le 5 Jan 2018
Modifié(e) : Cedric le 5 Jan 2018
Limitations with CSL outputs and nesting. Some are listed in the header of the overloads subsref and subsasgn in the class def. of tabular. (It is another topic, but if you look at the overload of numel, I think that it will explain [but not answer] several questions that we had on the forum, and point at undocumented mechanisms).
I have code somewhere that was testing this maybe on MATLAB 2016a/b and I'll have to dig it out of my backups. I checked quickly with 2018a-pre, but I can't seem to be able to reproduce what I observed earlier.
I had also indexing issues when I was storing tables as properties of my own classes, but again this was with older MATLAB versions and I didn't test since.
dpb
dpb le 7 Jan 2018
I was unaware of unstack (had just within the month been able to upgrade from R012b which predates it), so thanks for making aware of it.
The one issue w/ table is that most of the datasheets aren't regular so starting with the raw output from xlsread is the basic starting point. Granted, once one manages to cull out the desired pieces, then can put it into a table which is often helpful, particularly as it continues to get enhanced.
Sean de Wolski
Sean de Wolski le 8 Jan 2018
I've completely abandoned xlsread. I find between readtable (with spreadsheetimportoptions) and the import tool for interactive things, I can usually get what I need.
dpb
dpb le 8 Jan 2018
Modifié(e) : dpb le 9 Jan 2018
If the data are tabular and reasonably regular, one can generally get there from here. However, the particular spreadsheets aren't so much so; the basic data are, but there are multiple sections within each sheet and small but significant differences between sections plus each month may have a different number of sections/data per section as new funds and/or investment pools are created or removed. The only way I've found to deal with these programmatically(*) is to just start with the raw data and find the various pieces via lookup for keys. I've made a local version of XLSREAD that doesn't do the trimming on the numeric and text returns the builtin version does, but returns the same-sized array for each output; this makes lookups between the data types consistent in locations that helps a bunch; in the original there is always a (variable) offset from (say) the text array to raw and/or numeric arrays so locating where one particular element was within the spreadsheet is problematic. This is particularly a pit(proverbial)a(ppendage) since none of the search functions are tolerant of different data types within the target; hence one has to do string searches in the text array or go through excessive gyrations to remove convert all non-text elements from the raw array first (or vice versa if searching for numerics). Simply indexing on locations doesn't work as then the result of the location search is based on the resulting logical addressing vector length which has the aforementioned issue of being inconsistent in its origin relative to the actual location in the worksheet.
I just ran into another thing with readtable this AM that was going to post to the "What's Missing" thread Walter started -- it's lacking a number of the options needed; particularly in this case for the text side rather than spreadsheet and that is the 'CommentStyle' parameter for one...
(*) And, since there's one of these workbooks for each calendar month and am interested in multiple years, the option of interactively importing each is not attractive at all. Within each month the way it has been organized is that there are also some six working sheets of the same format, each with a different level of processing through the accounting month. (And, yes, this is no way to run a railroad, but it is the system that has evolved over some 20 years and while we're working on transitioning to a true accounting system, that will take some time yet before we can get there).
Sean de Wolski
Sean de Wolski le 9 Jan 2018
Can you provide a representative anonymized Excel sheet?
dpb
dpb le 10 Jan 2018
Modifié(e) : dpb le 11 Jan 2018
Well, it dawned on me it wasn't too difficult to write a script after all, here's a sample of two sheets selected from two different years pasted together into one spreadsheet. The names of donors have been randomized but I left in the subsection headings and some of the generic fund descriptions/names so it makes a little sense in reading what is there; totally randomized even I couldn't figure out what I was looking at!!! :)
What I am interested in now most specifically is the data from the capital campaign pools as the way the bookkeeper chose to enter gifts was by adding comments in the cell instead of creating a separate table or other way. Hence, if there's a balancing problem as it is now the only way to track it down is by manually reading the comment which is only visible by hovering over the particular cell or by showing all comments which the occludes almost the whole rest of the sheet. I was stripping out the comment fields(*) and building a usable database from them -- oh, and that alone kills readtable because it won't return comments anyway.
But, if you could illustrate readtable on the data as it exists, that would be good too as I will eventually be able to remove the comments from having needed actual data.
(*) That's the other thing the local copy of XLSREAD does; it returns another optional cell array the size of the others that contains the comments from each cell; I couldn't make the optional function argument work for that purpose with the original.
dpb
dpb le 11 Jan 2018
Modifié(e) : dpb le 11 Jan 2018
Well, the above is so for the actual cell data; "not so much" for the comments...I'll have to work some more to anonymize them or remove them entirely although I'd be interested to see if you were to have alternate ways to deal with their existence--even if not at present it illustrates potential useful enhancements TMW could at least think about.
OK, hopefully the attached is mostly clean -- I re-edit'ed comments in the NOV17 sheet for the campaign totals then inadvertently overwrote the edits so I only save one cell's worth; should be enough to get the picture.
I wrote my function as a first pass to be specific; I'll have to work on it some to be generic; bear with me for a while...
Sean de Wolski
Sean de Wolski le 12 Jan 2018
I'll try to look into this early next week.
dpb
dpb le 12 Jan 2018
Appreciate your efforts a bunch, Sean...the November worksheet didn't balance and it's taken the bookkeeper an incredible amount of time to track down why; obviously having data buried in the comments is a major reason that it is so difficult plus there's just a tremendous excess of complexity and redundancy in the whole mess.
If I can just get to the point of building a table from the comments and being able to use it for data entry and populate the rest of the spreadsheet from it for new gifts it'll be a major step forward; if I could dump XLSREAD in favor of READTABLE or an augmented version thereof, it would likely make things easier on my end knowing virtually nothing about Excel/VBA I've found the learning curve for writing Excel-only solutions even more painful...
Sean de Wolski
Sean de Wolski le 16 Jan 2018
readtable, even with supplied options, is embarrassingly slow for those sheets. I'll be reporting it to dev.
Using the import tool, I was able to pull all of the data into a table pretty quickly and generate the function to automate it for any sheet that looks like this. The advantage is that the data come in as strings, numeric, and categorical automatically and you could perhaps take advantage of exclusion/default rules.
Unfortunately, I don't know that this offers much over just managing xlsread/raw yourself, especially if the sheets have changing columns.
The generated function is attached.
OK, again thanks for taking a look, Sean. Hopefully if nothing else it can serve as useful fodder for dev. to improve the routine.
The penchant for importdata to build functions with hardcoded column numbers is an issue; over the years additional columns have been inserted here and there to either contain some additional information or more often than not, just add some spacing for aesthetics. Consequently, I'm reduced to having to search for column headers to locate which is where. That's not too bad; at least the headings of the data columns that are carried over haven't changed over the time frame over which I've had to work and that goes back a number of years.
Below is a snippet that really is all I've done uniquely; everything else is the home-brew xlsread routine that I just copied in order to add the optional TRIM flag to keep the consistency between various types.
% parse data into numeric and string arrays, keep sizes commensurate
trimData=false;
[numericData, textData] = SplitNumericAndText(rawData,trimData);
commentData=xlsComments(Excel,rangeObject);
% Would be more efficient to find stuff inside the range rather than read
% all, but Matlab syntax is much simpler when don't know VBA/Object Model
% So, first find the Pool heading rows, columns
[ir,ic]=cellfun(@(txt) findtxtincell(txt,textData),pools,'uniform',0);
for i=1:length(ir) % for each pool we found
rr=[]; % initialize accumulator for funds
j=ir{i}+1; % begin with row after pool name
while ~isempty(textData{j,2}) % find account rows in pool
rr=[rr; j]; % any none-empty row is an account
j=j+1;
end
rows{i,1}=[ir{i}; rr]; % save row indices for each pool
end
idx=cell2mat(rows); % get indices as addressing vector
textData=textData(idx,:);
numericData=numericData(idx,:);
rawData=rawData(idx,:);
commentData=commentData(idx,:);
if ~isempty(customFun)
customOutput=customOutput(idx,:);
end
The section returns the desired Pool subset(s) plus the comments associated with them. At this point it is then trivial to build the table representation as the returned output is now regular and includes the pieces wanted by identifiable section indices.
It took me a while to work out the VBA-equivalent for the COM interface to get at the comments but I posted a Question and the subsequent solution on that separately.
Again, thanks a lot for the time and interest...

Connectez-vous pour commenter.

Plus de réponses (1)

Cedric
Cedric le 31 Déc 2017
Modifié(e) : Cedric le 31 Déc 2017
Hi dpb,
Assuming that you have a cell array (and that you are using this table type of output just for a display purpose), here is a "neither-too-elegant-nor-too-satisfactory" approach:
data = {'10/8/17','A',10,100; '10/8/17','B',8,30; ...
'11/8/17','E',17,70; '11/8/17','A',14,80; '11/8/17','C',5,110; ...
'12/8/17','B',9,50} ;
[groupId, dates] = findgroups( data(:,1) ) ;
% - Build row/col IDs in output array.
rowId = reshape( (groupId * ones(1,3)).', [], 1 ) ;
colId = arrayfun( @(n)1+(1:3*n), accumarray(groupId, 1), 'Unif', 0 ) ;
colId = [colId{:}].' ;
% - Build output array.
outData = cell( max(rowId), max(colId) ) ;
outData(:,1) = dates ;
outData(sub2ind( size(outData), rowId, colId )) = reshape( data(:,2:end).', [], 1 ) ;
where the input data is:
>> data
data =
6×4 cell array
{'10/8/17'} {'A'} {[10]} {[100]}
{'10/8/17'} {'B'} {[ 8]} {[ 30]}
{'11/8/17'} {'E'} {[17]} {[ 70]}
{'11/8/17'} {'A'} {[14]} {[ 80]}
{'11/8/17'} {'C'} {[ 5]} {[110]}
{'12/8/17'} {'B'} {[ 9]} {[ 50]}
and the output is:
>> outData
outData =
3×10 cell array
{'10/8/17'} {'A'} {[10]} {[100]} {'B' } {[ 8]} {[ 30]} {0×0 double} {0×0 double} {0×0 double}
{'11/8/17'} {'E'} {[17]} {[ 70]} {'A' } {[ 14]} {[ 80]} {'C' } {[ 5]} {[ 110]}
{'12/8/17'} {'B'} {[ 9]} {[ 50]} {0×0 double} {0×0 double} {0×0 double} {0×0 double} {0×0 double} {0×0 double}
Hoping that you'll get a more elegant approach ..
Happy new year!
Cedric
EDIT 1 : Using SPLITAPPLY can help to some extent:
>> outData = splitapply( @(x){reshape(x.',[],1).'}, data(:,2:end), groupId )
outData =
3×1 cell array
{1×6 cell}
{1×9 cell}
{1×3 cell}
but then we must implement padding for the concatenation.

7 commentaires

Cedric -- Yes, you read what was sorta' said that the initial point is a cell array and the table was just convenient output format.
I was working along those lines but got lost in the weeds before I got the result. :) It seems like there should be a parallel accumcell to accumarray that would let one build a set of indexing vectors from the group indices and use them directly.
It seems to me that the grouping computations/arranging, etc., are woefully inadequate without way too much detailed gyrations. It's been a long time now since I actually used it, but ISTR that didn't have nearly the issues in SAS with doing such but maybe it's just the haze of distance that makes it seem as though it was simpler and I've just forgotten -- or mayhaps I had just learned it more thoroughly than I've yet mastered this part of Matlab.
Anyways, I had coded it up just using a looping construct but just seemed as though it should be "straightahead" vectorized solution. (Atho I ended up convert to a table for the named access from the cell by the time I got here...)
% Rearrange and consolidate by date; first strip out redundant info into table
EdJW=cell2table([EdJWeeast(:,1), strip(EdJWeeast(:,3)), EdJWeeast(:,5), EdJWeeast(:,7)],'VariableNames',hdrs);
[GD,IDD]=findgroups(EdJW.Date); % grouping variable by date
[GF,IDF]=findgroups(EdJW.Fund); % and by fund name
nGD=histcounts(GD,length(IDD)); % counts for each date
T=cell(length(IDD),1+3*length(IDF)); % empty cell array to fill
ixF=0; % counter for funds index
for i=1:length(IDD) % each date found
T(i,1)={IDD(i)};
for j=1:nGD(i)
ixF=ixF+1;
T(i,3*GF(ixF)-1)=EdJW.Fund(ixF);
T(i,3*GF(ixF) )=EdJW.Shares(ixF);
T(i,3*GF(ixF)+1)={EdJW.Value(ixF)};
end
end
Cedric
Cedric le 31 Déc 2017
I agree that there should be a builtin for advanced reshaping by group. I updated my code (attached) a little, using UNIQUE with the stable arg because (forcefully) FINDGROUPS sorts. It seems to output what you are after.
dpb
dpb le 1 Jan 2018
Modifié(e) : dpb le 1 Jan 2018
Thanks, Cedric, the arrayfun over accumarray step is where I got myself flummoxed in my first attempt at something very similar that prompted me to post the query...very good!!
HNY!!! --dpb
PS. I think I will submit the enhancement request for accumcell and see where it goes...
I also realized overnight that with SAS one doesn't have to deal with the intimacies of the internal data manipulations nearly so much; you can create/have such a disparate DATASET as in the ML table and just PROC it with BY clauses and it all just happens. Not sure if there's ever the possibility of TMW getting to that kind of integration or not; the two fundamental bases for what the end intent was were so far removed from each other at inception it's hard to see how. ML has "just growed" into something far removed from its initial goal but to retain those features is somewhat akin to the problems integrating OOP and other modern constructs into Fortran yet retain backward compatibility there as well...choices that could be made for new features just can't be accepted because it would totally invalidate too much extant code.
Cedric
Cedric le 2 Jan 2018
Modifié(e) : Cedric le 2 Jan 2018
My pleasure dpb! Note that one the most elegant/concise ways to start is probably:
[~, dateId, groupId] = unique( data(:,1), 'stable' ) ;
outData = splitapply( @(x){reshape(x.',[],1).'}, data(:,2:end), groupId )
If you run this on data as defined in the M-File above, you will see that you get all the rows in a cell array (minus the first column that is trivial). Yet, we must then obviously pad this output before we can concatenate it (or at least perform some processing). Surprisingly, this takes several lines of code to perform, so overall I would not say that it is really cleaner!
Cheers!
Cedric
dpb
dpb le 3 Jan 2018
splitapply is still a struggle for me to think of how to apply it for more than the most straightforward cases as are shown in the documentation (one of the problems in the doc is that only the most obvious uses are often the only examples given).
I was looking at another poster's query a few days ago (and I need to get back and see if another ever did post a working solution) that needed a two-tier grouping that couldn't seem to find a way to write a nested expression that would work correctly.
I don't know what the correct answer is, but I can't help but think there's a better overall approach than the tack TMW has headed down--but again, it may be that the constraints of adding it on top of existing Matlab means such higher abstractions just aren't feasible/possible.
Cedric
Cedric le 3 Jan 2018
Modifié(e) : Cedric le 3 Jan 2018
I have the same feeling, there should be a better approach!
About SPLITAPPLY, I think that it is not intuitive because, in the beginning, we have a hard time guessing how the input array is split and to what the function passed as 1st arg is applied. A good way to check that out is to DISP it with no output arg:
>> splitapply( @disp, data(:,2:end), groupId )
'American International Growt…' '67.958 shares' [1.9796e+03]
'American Mutual Fund' '54.930 shares' [1.9973e+03]
'Capital Income Builder' '34.843 shares' [1.9972e+03]
'Europacific Growth Fund' '41.528 shares' [1.9888e+03]
'Income Fund of America Fund' '95.102 shares' [2.0019e+03]
'American International Growt…' '67.958 shares' [1.9366e+03]
'American Mutual Fund' '54.930 shares' [1.9521e+03]
'Capital Income Builder' '34.843 shares' [1.9642e+03]
'Europacific Growth Fund' '41.528 shares' [1.9411e+03]
'Income Fund of America Fund' '95.102 shares' [1.9753e+03]
'American International Growt…' '67.958 shares' [1.8471e+03]
'American Mutual Fund' '54.930 shares' [1.8945e+03]
'Capital Income Builder' '34.843 shares' [1.9396e+03]
'Europacific Growth Fund' '41.528 shares' [ 1823]
'Income Fund of America Fund' '95.102 shares' [1.9304e+03]
'American International Growt…' '67.958 shares' [ 44.0200]
'Europacific Growth Fund' '41.528 shares' [101.2100]
'American International Growt…' '67.958 shares' [ 44.3000]
'Europacific Growth Fund' '41.528 shares' [103.3500]
'American International Growt…' '67.958 shares' [ 44.3000]
'Europacific Growth Fund' '41.528 shares' [103.3500]
Seeing this, it becomes obvious why I used:
@(x){reshape(x.',[],1).'}
in my previous comment, and why I pass all columns but the first.
dpb
dpb le 4 Jan 2018
Yeah, I often do that disp "trick" with all the various XXXXfun incantations to discover just what the argument list passed to the (often anonymous) function really is--it isn't always all that intuitive as you point out.

Connectez-vous pour commenter.

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by