Identifying Missing data years in and retaining years with maximum value

1 vue (au cours des 30 derniers jours)
Poulomi Ganguli
Poulomi Ganguli le 12 Août 2017
Commenté : dpb le 13 Août 2017
Hello,
I have a matrix of meteorological data in foll. format, with 1st column is for the year (1993 to 2014), 2nd column is month, 3rd column is day, 4th is hour, and the last column is value. The value of 4th column varies from 0 to 23 hr, making a full day. An winter period is defined as from previous year's month of October to next year March. Since it's hourly data, the total number of rows for each season should have 26064 rows (= 24 hr*6 months* 182 days) for nonleap years and 26352 rows (24 hr*6 months* 183 days) for leap years. I need to check whether the winter period of successive years has more than 25% of data (or 6516 for nonleap years/ 6588 for leap years) available. In case data is less than 25%, I have to check in which half of year more number of data are missing (for example in present case it is the year 1993) and exclude that year's rows completely from the output file while retaining the next year's row. Likewise, I have to check for all successive years from 1993 to 2014.
1993 10 1 0 2.44
1993 10 1 1 2.04
1993 10 1 2 1.79
1993 10 1 3 1.72
1993 10 1 4 1.395
1993 10 1 5 1.154
1993 10 1 6 0.913
1993 10 1 7 0.672
1993 10 1 8 0.431
1993 10 1 9 0.19
1993 10 1 10 2.44
1993 10 1 11 2.04
1993 10 1 12 Nan
1993 10 1 13 Nan
1993 10 1 14 Nan
1993 10 1 15 Nan
1993 10 1 16 Nan
1993 10 1 17 Nan
1993 10 1 18 Nan
1993 10 1 19 Nan
1993 10 1 20 Nan
1993 10 1 21 Nan
1993 10 1 22 Nan
1993 10 1 23 Nan
...................................
...................................
1994 3 31 23 3.82
1994 3 31 23 3.9
1994 3 31 23 3.66
  4 commentaires
dpb
dpb le 13 Août 2017
Modifié(e) : dpb le 13 Août 2017
Well, actually I've been off 'spearmint-ing with the timeseries object, which while it isn't really all that new I've not ever actually used. It seemed as though it should be suited for such manipulations...
Of course, it's not terribly difficult to simply use datetime or datenum and do manipulations directly on the times, but I thought I'd see if the time series actually has anything to add here...a start along the way is in the Answer albeit not complete as yet...but I think it should lead to a solution if you'll pursue on the path--
However, you really didn't answer the question regarding the 50% rule you gave; you did answer the "which year does the data belong?" question so I'll just presume the time during which data are missing within the season actually is immaterial -- seems to me it has to be, anyway. You can adjust however you see fit if there is some other reason/pattern that is significant.
dpb
dpb le 13 Août 2017
...
1994 3 31 23 3.82
1994 3 31 23 3.9
1994 3 31 23 3.66
Whassup w/ that? Are there really such duplicates in the file or is that just an error in the post that those are supposed to be hours 21, 22, 23?

Connectez-vous pour commenter.

Réponse acceptée

dpb
dpb le 13 Août 2017
As noted in the comment, I wondered about <timeseries> ability to help solve the problem so I began with a trial...
>> ts=timeseries(abs(randn(2000,1))); % form a test time series of 2000 points
>> ts.TimeInfo.StartDate='01-Jan-1993'; % and set a fixed calendar start time to match
>> ts.TimeInfo.Units='days'; % set the interval to days so will cover some years
Not using the semicolon will echo basic info to the command line--clicking on the hotspot will give useful info on the content, the timetdata link yields:
timeseries
Common Properties:
Name: 'unnamed'
Time: [2000x1 double]
TimeInfo: [1x1 tsdata.timemetadata]
Data: [2000x1 double]
DataInfo: [1x1 tsdata.datametadata]
<More properties> , <Methods>
tsdata.timemetadata
Package: tsdata
Uniform Time:
Length 2000
Increment 1 days
Time Range:
Start 01-Jan-1993 00:00:00
End 23-Jun-1998 00:00:00
Common Properties:
Units: 'days'
Format: ''
StartDate: '01-Jan-1993'
...
which shows we do have a series that covers a range of times of interest that can look at retrieving winter seasons over. In your case, you'll use the beginning of your real dataset and an hourly interval over the length of the actual data, but the idea is the same.
So now let's create fall/spring dates as events:
>> ts=addevent(ts,{'Fall'},{'01-Oct-1993','01-Oct-1994','01-Oct-1995','01-Oct-1996','01-Oct-1997','01-Oct-1998'})
Error using timeseries/addevent (line 66)
When adding event(s) by name, the name and time cell arrays must have the same size.
>>
Well that's a bummer; it won't expand the name string to match multiple dates automagically so we must use:
>> ts=addevent(ts,{'Fall','Fall','Fall','Fall','Fall','Fall'}, ...
{'01-Oct-1993','01-Oct-1994','01-Oct-1995','01-Oct-1996','01-Oct-1997','01-Oct-1998'});
>> ts=addevent(ts,{'Spring','Spring','Spring','Spring','Spring','Spring'}, ...
{'31-Mar-1994','31-Mar-1995','31-Mar-1996','31-Mar-1997','31-Mar-1998','31-Mar-1999'})
timeseries
Common Properties:
Name: 'unnamed'
Time: [2000x1 double]
TimeInfo: [1x1 tsdata.timemetadata]
Data: [2000x1 double]
DataInfo: [1x1 tsdata.datametadata]
Events: [1x12 tsdata.event]
More properties, Methods
>>
Where we see we've now got 12 events defined, six fall and six spring dates corresponding to the times you've defined as winter season. While I wrote these explicitly as dates, the syntax will accept datenum values or you could build the cellstr date strings dynamically over the time span of interest to automate the process--this is just testing concept here...
So now we want to see if can retrieve the values and if they match the expected season for calculations...
gettsbetweenevents(ts,'Fall','Spring',1,1)
timeseries
Common Properties:
Name: 'unnamed'
Time: [182x1 double]
TimeInfo: [1x1 tsdata.timemetadata]
Data: [182x1 double]
DataInfo: [1x1 tsdata.datametadata]
Events: [1x12 tsdata.event]
More properties, Methods
>>
And, lo! and behold! Indeed they do! We've got the expected 182 elements in a new timeseries object (we didn't bother saving, btw, don't forget the LHS asssignment) and it's now easy enough to check the number of data values that are missing (isnan) and make the assessment as to whether to keep for the given year or not.
Obviously the above can be put in a loop for the years and get the subsequent events or you could only create/delete a given event on a year-by-year basis. It does not appear that any of these methods are vectorized to operate over arrays of indices, however, so loops will be needed. But, it's a pretty convenient way to collect the data it appears.
Unfortunately, unlike with grouping variables, it does not appear as though TMW has implemented the facility as with the grpstats routine to have an aribtrary function handle to operate on the collection data.
The alternative approach is to take you input file as shown above and create the datetime value associated with each record and then simply use the isbetween function to return those values between the given dates. Again this would by in a loop over the years in the dataset -- that would look something like
t=datetime(X(:,1),X(:,2),X(:,3),X(:,4),0,0); % convert to datetime array
for yr=1993:2013 % loop over the years (fall)
iswint=isbetween(t,datetime(yr,10,1),datetime(yr+1,3,31); % logical array of elements
data=X(iswint,5); % the data for the year in the range
% do the test on number, etc., here
...
end
Above ought to be a roadmap of a couple ways could proceed...the timeseries may be more elegant but is more verbose; datetime operations probably simpler and less time invested in reading documentation to get a result...

Plus de réponses (0)

Catégories

En savoir plus sur Calendar dans Help Center et File Exchange

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by