using acummarray to average several columns at a time?

Hello
I have a data array (mat) with the following dimensions: 149016x93
The columns are
year | month | day | hour | data 1 | data 2 | data 3 | and so on until data 89
2001 | 1 | 1 | 0 | random numbers ...
... | ... | ... | ... | random numbers ...
2017 | 12 | 31 | 23 | random numbers ...
The data is random and it is what I want to average.
I found this example (MathWorks example) and it is fine, however I've been strugling in how to run it over column 5 to 93...
[ah,~,ch] = unique(mat(:,2:4),'rows');
hraverage = [ah,accumarray(ch,mat(:,5),[],@nanmean)];
My problem is that I'm not being able to have as an output the 8784x93 array, only an 8784* x 4, I've tried loops but i'm missing something that I am not aware of...
*The dataset has several years of data. I want the hour average for each each day of the year. So it's 366 days * 24hours = 8784
for the sake of example, please feel free to consider a smaller array.
thank you for the attention! will keep digging on this...
sample data in attachment. randomly generated:
4 first collumns are: year, month, day, hour, and columns 5 to 7 are data columns.
the final result should be a 8784x7 file.

3 commentaires

Why 8784 out of 14901 ?
Osnofa
Osnofa le 29 Déc 2018
Modifié(e) : Osnofa le 29 Déc 2018
great question. missed that explanation in the opening post.
The dataset has several years of data. I want the hour average for each day of the year. So it's 366 days * 24hours = 8784
that is why I use:
[ah,~,ch] = unique(mat(:,2:4),'rows');
column 2 is month,3 is days and 4 is hours.
dpb
dpb le 29 Déc 2018
Convert to timetable and use retime and/or findgroups/splitapply pair

Connectez-vous pour commenter.

 Réponse acceptée

There's a simpler way of doing this with groupsummary
I imported the sampledata and made the table with 7 columns: Year,Month,Day,Hour,VarName5,VarName6,VarName7
Then used the following commands to take advantage of binning in groupsummary and of being able to include empty groups:
sampledata.Time = datetime(sampledata.Year,sampledata.Month,sampledata.Day)
result = groupsummary(sampledata,{'Time','Hour'},{'dayofyear','none'},'mean',{'VarName5','VarName6','VarName7'},'IncludeEmptyGroups',1)

2 commentaires

I didn't know about this function. So is this kind of like grpstats() but it's in base MATLAB so you don't need the Stats toolbox, and it has additional computations?
It's a fairly new function from R2018a in base MATLAB. Yes it is very similar to grpstats but should be nicer to use for tables.
The extra options it has relate to binning, missing data and empty groups.

Connectez-vous pour commenter.

Plus de réponses (2)

dpb
dpb le 30 Déc 2018
t=readtable('sampledata.xls','ReadVariableNames',0); % read file to table
t.Properties.VariableNames={'Year','Month','Day','Hour','D1','D2','D3'}; % convenient names
tt=table2timetable(t(:,5:end),'RowTimes',datetime(t.Year,t.Month,t.Day,t.Hour,0,0)); % to timetable
mnDaily=retime(tt,'daily','mean'); % averages by day
See what we gots...
>> mnDaily(1:4,:)
ans =
4×3 timetable
Time D1 D2 D3
____________________ ______ ______ ______
01-Jan-2001 00:00:00 701 173.5 639.5
02-Jan-2001 00:00:00 223 614 484
03-Jan-2001 00:00:00 642.33 196 598.33
04-Jan-2001 00:00:00 318 243.25 534.5
>>
For real case with a very large number of variables, rather than naming them all sequentially, I'd sugget reading the spreasheet data as an array and build the table from it instead...
data=xlsread('sampledata.xls');
t=table(datetime(data(:,1),data(:,2),data(:,3),data(:,4),0,0),data(:,5:end));
t.Properties.VariableNames={'Date','Data'};
tt=table2timetable(t);
mnDaily=retime(tt,'daily','mean');
and will have same result excepting the means will be an array instead of individual variables.

5 commentaires

dpb
dpb le 30 Déc 2018
BTW, NB: the use of the text string 'mean' for the agglomeration function in retime; with the array there seems to be what appears to me at first blush to be a problem; using @mean or @nanmean failed with an indication the function didn't return a commensurately-size row vector which is just patently false, both by defaul operate over the first dimension of the input array -- unless somehow internally retime isn't returning an array of the size but a vector of the results, maybe??? I've got stuff I needs to be getting done so don't have time to try to get to the bottom of that, now...but since the builtins don't return NaN elements automagically, you should be "good to go"...
dpb
dpb le 31 Déc 2018
I see did want hourly instead of daily--just change the argument to retime...
dpb, usually that's an indication of a group with only one row, in which case the default behavior of mean is to return the vector's mean. That's one of the reasons for 'mean' -- it accounts for that. Otherwise, you'd need @(x) mean(x,1).
dpb
dpb le 2 Jan 2019
Modifié(e) : dpb le 3 Jan 2019
Ah...you beat me to it, Peter! It came to me while doing the other mind-numbing data cleanup task that had taken break from and just came back to comment on the "why"...
I didn't double-check for absolute certain of whether was one or none but there were a lot of NaN elements in the sample data set and I suspect there was at least one subset that turned out empty altho the symptom also fits the one-row scenario.
Hmm....that's an interesting result...
mean([])
returns NaN, not []. What's the logic in that?
Answers Self... :)
Makes the PP example work in returning vector result...
A little bit of shameless blog promotion for that exact use case:

Connectez-vous pour commenter.

Image Analyst
Image Analyst le 29 Déc 2018

1 vote

Why not simply use grpstats() if you have the Statistics and Machine Learning Toolbox?
Attach your data if you need help.

7 commentaires

Osnofa
Osnofa le 29 Déc 2018
Modifié(e) : Osnofa le 29 Déc 2018
thanks for the tip.
I'm going to check it. meanwhile, I've generated a smaller file and attached it.
it is as 25000x7 file.
4 first collumns are: year, month, day, hour, and columns 5 to 7 are data columns.
the final result should be a 8784x7 array.
This seems to work fine, doesn't it?
data = xlsread('sampledata.xls');
% Extract data into conveniently named vectors.
years = data(:, 1);
months = data(:, 2);
days = data(:, 3);
hours = data(:, 4);
data1 = data(:, 5);
data2 = data(:, 6);
data3 = data(:, 7);
% Get yearly, monthly, daily, and hourly means for data1 column.
yearlyMeans = grpstats(data1, years);
monthlyMeans = grpstats(data1, months);
dailyMeans = grpstats(data1, days);
hourlyMeans = grpstats(data1, hours);
% Repeat for other columns of data.
Osnofa
Osnofa le 29 Déc 2018
Modifié(e) : Osnofa le 29 Déc 2018
Not quite what I need.
I need the average hourly value for each day of each month. 366 (days) * 24hours= 8784 number of data rows. And then I have the same problem of repeating the process for every other column (in this example only 3 but on my real file there are 89 data columns).
Get the hour of the month? You can figure that out can't you? You just multiply the day by 24 and add the hours. Simple, right?
hourOfTheMonth = (days - 1) * 24 + hours;
hourlyMeans = grpstats(data1, hourOfTheMonth);
Osnofa
Osnofa le 30 Déc 2018
Modifié(e) : Osnofa le 30 Déc 2018
well, maybe i'm not being clear.
I need the average* for hour 1 of day 1 of month 1, hour 2 of day 1 of month 1, ... , hour 23 of day 31 of month 12. 24*365 or *366 depending on the year.
using:
hourlyMeans = grpstats(data1, hours);
will provide the hourly means for the 24 hours. it will average the value of hour 1 without caring if it is day 1 of month 2 or day 10 of month 9. I will get a 24x1 array.
If you try this maybe it will be clear:
[ah,~,ch] = unique(mat(:,2:4),'rows');
hraverage = [ah,accumarray(ch,mat(:,5),[],@nanmean)];
mat is the data array I provided. you will no get a 8784x4 because some days/hours were not randomly generated, you will get 8391x4.
thank you for your time.
dpb
dpb le 30 Déc 2018
As noted above, convert the date column data to datetimes and put into a timetable and then use findgroups/splitapply pair or retime
This sort of thing is precisely what they're for...
Osnofa
Osnofa le 30 Déc 2018
will take a look into it, didn't notice yesterday. thanks for the reminder.

Connectez-vous pour commenter.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by