Creating a matrix after giving initial conditions
Afficher commentaires plus anciens
Hi,
I need to create a matrix when given the begin year, month, day, time and end year, month, day and time.
for example; (I do not need to see the header in the matrix)
I would like to enter:
begin year = 1998
begin month = 1
begin day = 1
begin time = 0
end year = 1998
end month = 2
end day = 15
end time = 12
(as shown in the table below)
But, if I enter leap years (e.g. 2000,2004, etc) it should include Feb 29 and do the same time steps (0 to 21) as shown in the table.
Then, the program should generate the output for the time periods given above. See the sample below.
Year Month Day Time
1998 1 1 0
1998 1 1 3
1998 1 1 6
1998 1 1 9
1998 1 1 12
1998 1 1 15
1998 1 1 18
1998 1 1 21
1998 1 2 0
.
.
.
1998 2 15 12
Any help is appreciated.
Thanks in advance.
Réponse acceptée
Plus de réponses (2)
Since dir returns a sorted list, I think I'd just process it similarly as you are but simply test the days within the loop by parsing the date/time string within a double loop. One could, as mentioned, get fancy and do some serious attempts at matching datenums but just doesn't seem to be needed here...
I'd start with something like the following which builds an array of the summed third column value in matching days by year. Each day is in a given column, only a small additional logic is needed to keep track of which day is present, a month and day row filled in by column while in the loop...
d=dir('TRMM_*_newntcl.csv');
ymdh=cell2mat(textscan([d.name], ...
'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...
'collectoutput',true));
idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations
N=length(idx);
idx=[idx;length(ymdh)+1];
precip=nan(51681,fix(N)); % preallocate on num whole days
k=0;
for i=1:length(idx)-1
isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;
if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in
data=zeros(51681,8); % preallocate based on num lat/lon points
for j=0:7 % ok, process each of these, they're one day's worth
data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data
end
k=k+1;
precip(:,k)=sum(data,2); % the daily sums are in an array
fn=d(idx(i)).name; fn(16:17)=[]; % fn(1:1)='O';
disp(fn)
csvwrite(fn,precip(:,k))
end
precip(:,k+1:end)=[]; % remove columns to leave complete days only
This snippet reads only the third column on the assumption you can easily add the constant position columns as you see fit.
I've never used a netCDF file so on casting it into that form I'm not of much use but the above should provide the basic building block to get the data consolidated by day. Note I specifically avoided cell arrays here and did the summing in place while accumulating the data to keep total memory down to minimal amount...as you've already done, use this inside the higher level loop to traverse the directories.
Unless there aren't multiple partial days, the final value of k will be the same as the floor(L/8) calculation. If there are there will be that many zero columns that can simply remove.
Hope that helps...
30 commentaires
Damith
le 18 Mar 2015
dpb
le 18 Mar 2015
"...Does it removes/disregard partial days"
Yes, that's what the logic test checks for--that there are eight sets with the same day consecutively in the grouping. If not, the continue goes on to the next 0 hours location.
Sorry, a typo; I had used ix1 as the index variable in a test at the command line on the logic in that line and missed editing out one 'x' in the cleanup. In "real" code I'd probably turn i1 into idx or somesuch as the '1' is hard to read. Initially I was thinking I'd have an ix1 and ix2 as the starting/stopping indices but decided on the one instead and didn't clean it up to reflect that. Your choice...
Seems functional here...I copied your file from the link to local...
>> dir TRM*.csv
TRMM_1998_01_0100_newntcl.csv
>>
>> d=dir('TRM*.csv');
>> for i=1:length(d),data=csvread(d(i).name,0,2);end
>> whos data
Name Size Bytes Class Attributes
data 51681x1 413448 double
>> d.name
ans =
TRMM_1998_01_0100_newntcl.csv
How about the section inline from the command window showing everything in context? Maybe can spot something that way I can't see from the above...certainly nothing sticks out to me.
ADDENDUM
for i=idx:length(idx)
Above should be
for i=1:length(idx)
Since idx is a vector, this reassigned the loop index i to that vector so then the argument of d(i) is a multiple group of results instead of just one.
That's a remnant of my initial planning on use i1:i2 as a range and I didn't catch in the cleanup either. The loop is processing each entry in the index vector; the index vector is those locations in the directory list for which the time field is identically zero.
Got the whole vector idx where should just be i subscript to process from the first of the set thru the 8, offset j=0:7 ...
...
data(:,j+1)=csvread(d(idx(i)+j).name,0,2);
...
I believe I caught all references in the original code in toto...
Anyway, if you understand what the logic is doing, should be able to fix any remaining typos pretty quickly.
Again, we're
- finding the locations of each 00 hour in the directory listing, then
- iterating through that list
- makomg sure are eight entries with the same day before the next 00 hours
- if so, loop over the initial plus seven more reading a single file
- when that loop finishes, sum across the eight columns and store
Damith
le 19 Mar 2015
Oh, yeah, the last 00hrs element won't have a subsequent to test. I don't have time to really think too much about it at the moment but run the loop from 1:length(idx)-1 instead of over the full length.
That change handles all except the end case. So, after the loop finishes on the next-to-last, add one more section that checks to see if the length of the directory listing is greater than idx(end); if so, and is seven more it's another complete (assuming again the days are all the same so it's not just some mishmash of random dates/times that happens to add up to the right number of entries) then run the j loop one more time and increment k one more time and accumulate the sum as well.
Alternatively, you could add an additional if...end clause in the loop for the last iteration and change the test of isOK to the above instead of the existing for that...or you could add one more record as a dummy at the end of the list altho that's probably more effort than either of the other two choices.
Anyway, it's just an end case that needs handling just a little differently than the rest...
ADDENDUM
Woke up this morning and the (almost) trivial fixup came to me...the modification of the index vector is the easiest after all. When create it, simply augment it with the value of length(ymdh)+1 as if were another 00hr entry, but run the outer loop over the original length. Then the +1 index doesn't run over the end on the spacing check but you also don't try to run the loop too many times. See the modification in the ANSWER above...
The changes are...
Add the additional to the end of the index vector...
idx=[idx; length(ymdh)+1];
and change the loop upper limit to length(idx)-1
for i=1:length(idx)-1
and should work.
dpb
le 19 Mar 2015
One last thing occurred to me in thinking about the end effect...the case of possibly there being more missing days than just a partial at the end/beginning would leave the preallocated array with that many zero columns. If you're keeping a running indicator of the month/day as an extra couple of rows or separate vectors as suggested, you can check against those for the number and k is also the size. But, a little more robustness could be added if use nan instead of zeros for the preallocation; then any columns with NaN are clearly unused and not possible to confuse with a real value of zero as a result.
dpb
le 19 Mar 2015
OK, give me a little while to look at the actual data files...we're fortunately getting a little rain today so am not farming... :)
Damith
le 19 Mar 2015
Well, how stupid...the data array is getting reinitialized to zeros every pass thru the j loop so only the last column is ever saved. DOH!!! Can't believe I hadn't noticed that misplacement long before now. Problem with "air code"; often one doesn't pick up on the obvious.
OK, I made a couple of other minor cleanups, and tested it on your subset of data...
>> damith
precip(1:10,:)
ans =
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
3.5000 3.0000
>>
That look "more better"???
I updated the script in the Answer section to keep it all together; I just replaced the previous in its entirety rather than try to keep all the iterations...
ADDENDUM
BTW, it seemed to take longer than I'dve thunk it should here so I checked on whether the column-specific read in csvread was the culprit...turns out it's quicker with it than without. I didn't try any other alternative input functions, will leave that as "exercise for the student". :)
I didn't convert the script to a function so that all JIT effects are active, either; that may fix a multitude of sins (or again, may not).
dpb
le 19 Mar 2015
"I achieved the task like the way below"
...
idx=[idx; length(ymdh)];
The above is wrong; see my code. This will fail if the last day is full because the length computed won't be eight but seven. To prove it to yourself try it with your sample subdirectory if you remove the last two entries and leave 16 instead of 18; you'll only get one day's result instead of two.
If you don't have any reason to have the daily sums as a group there's no point in saving the whole array; I had presumed having those in one location for further analyses would be a primary point of the exercise; continuing to have so many files with so much repetitive data seems counter productive to the max. But, sure, inside the loop on i simply use a substring operation on the file name to remove the time section; doesn't matter which one so may as well use the last when the j loop has completed...
fnameout=d(idx(i)+j).name; % save to a temporary
fnameout(16:17)=[]; % remove the two hour digits from ddhh
csvwrite(fnameout,[xy precip(:,i)] % write to a csv file
xy is the array of coordinates saved from an initial read of one file previously, of course, since they're constant across all files.
Again, if this is all you're going to be doing with the sums I'd not even bother to save the precip array at all; just use the result of the sum directly in the concatenation.
I know you've said your goal is a netCDF file but if you're really going to be doing something with these data in Matlab a .mat file would be much smaller and faster keeping stuff together in a larger conglomerate.
Damith
le 19 Mar 2015
dpb
le 19 Mar 2015
Well, it won't put it out for each day unless it's inside the loop over i, no. That you've got it outside all the loops is the problem.
Again, if this is all you're going to do, I'd remove the precip array entirely and make it look something like...
for i=1:N
isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;
if ~isOK, continue, end
data=zeros(51681,8);
for j=0:7
data(:,j+1)=csvread(d(idx(i)+j).name,0,2);
end
fname=d(idx(i)+j).name; fname(16:17)=[];
csvwrite([xy fnsum(data,2)]
end
There's no point in saving what you're never going to use. I ran the filename creation portion here w/o a trailing semicolon and got...
>> damith
fn =
TRMM_1998_01_01_newntcl.csv
fn =
TRMM_1998_01_02_newntcl.csv
>>
(Clearly I just used "fn" as the variable name for testing)
dpb
le 20 Mar 2015
Well, the first thing to do is to check what the list of files returned is and the length of the index vector--firstly, does it have the proper number of entries and then secondly is all(diff(idx)==8) true (1) or false (0)? If false, how many are true (sum(diff(idx)))? That's how many would be at least candidates. Double check that the directory returned is, indeed, sorted and that the names are such that the sorting is correct.
After that, don't see a particular reason otomh, but it surely can't be too complex if you do some diagnostic checking.
As for monthly and yearly sums, you're changing ground rules--I don't know how many times I mentioned "if this is all you're going to do" with the data and you never said word boo! it wasn't... :(
In that case you've got to start accumulating the daily sums again and keeping them such that you can then sum over them over the proper dimensions. Question I'd have is, however, how do you define what to sum over for a month and a year--do you need complete data sets for all days of the month, for example, since we've got cases here where we've got missing days since we ignored days for which didn't have a complete set of 3-hour observations.
ERRATUM Was too late last night, actually the above should be sum(diff(idx)==8) as well to see how many valid days there are found. Turns out here that returns 45 and
>> idx(end-1)/8
ans =
45.1250
>>
so '45' is the right answer. I did download the files and it ran to completion here ok as expected. Just to make it easy I changed the output file name to have the first character be an 'O' for output just so could do an easy dir...the result is
>> o=dir('o*.csv');
>> length(o)
ans =
45
>> o(end).name
ans =
ORMM_1998_02_14_newntcl.csv
>>
Hence, I conclude the code as I posted in the above 'Answer' in one location is functioning as expected. Just to be certain it's in synch with what is here I just replaced the code there with that taken from the editor window here that produced the above results (with the 'O' patch commented out)...
dpb
le 20 Mar 2015
"...to obtain the monthly and yearly sum ..."
OK, it's just an extension of the above idea but as noted above you've got to decide what it is you want to do about the missing cases. Assuming the data are complete-enough you can do as above and ignore the missing entirely, something like the following logic should work...
Within each subdirectory with this loop you've got a year's worth of data if it is complete.
While processing the above loop for each day, we've built an array of preciptation values by day for complete days but incomplete days were skipped over entirely and the array has no space for them. If, otoh, we built a 3D array that is 31 columns wide for the longest months and 12 planes deep for each month and stored the valid daily data in the proper column then can simply sum over the 2nd dimension (columns) for the monthly sums and over the planes or the yearly. Now whether either of these is valid is dependent upon whether there are any "holes" in it or not and how you want to treat those.
Unless you also want to aggregate over multiple years, that's all you have to do other than simply iterate the above over the various subdirectories.
Damith
le 20 Mar 2015
dpb
le 20 Mar 2015
Then please "Accept" the answer...
Not the second with the working code base...
As for the second part, NB you have ymdh array around and available for the indexing mentioned above...
ADDENDUM
Having a working function for the days, you might look at a File Exchange submission <filefun--apply a function to files> to abstract the code for the higher levels. While I've not used it yet, it is highly rated and just might make writing the iteration stuff much simpler.
UPDATE
I did download and look at filefun -- while very nicely done and would be very useful for many iterative/repetitive tasks, in this particular case "not so much" since the files needs must be processed as groups instead of actually applying a function to each individually. So, you do need the loop to traverse the subdirectories...guess the function needed here would be one higher level of abstraction dirfun that would apply the function to the contents of a directory which will be your outer loop when you're done.
Damith
le 23 Mar 2015
Maybe once you accepted the first on the dates it won't let you accept a 2nd answer, too, maybe? I hadn't tried, I only respond rarely ask questions (and the ones I have have been rather esoteric so don't necessarily get answers... :) )
That was discussed way up earlier in response to your first question after the posting of the code:
"...Does it removes/disregard partial days"
Yes, that's what the logic test checks for--that there are eight sets with the same day consecutively in the grouping. If not, the continue goes on to the next 0 hours location.
It's looking at the 3rd column (days) of the ymdh array and checking that there are no differences >0 amongst those in the group between the current 0-hr entry and the next and then that there are a total of eight between groups. The first is the same as
all(ymdh(idx(i):idx(i+1)-1,3)==ymdh(idx(i),3))
I just chose to check the differences were all zero instead of that all were the same value as the first, explicitly. It's the same end result.
If you're going to process the daily files for complete days I don't think that's the place to be modifying...you should be able to simply process the daily data from the precip array as mentioned before after accumulating it for each day (again, presuming you keep track as suggested of the actual day of month instead of just incrementing a counter and store in the column based on day.
Damith
le 24 Mar 2015
dpb
le 24 Mar 2015
Go back an reread the comments starting on Mar 20@17:45 where I described how to build the 3D array of daily totals that you can subsequently simply sum by column for each plane and by plane for monthly and yearly totals for each year. What's the issue in doing that, it's simply modifying the size of the precip array initially and then using the ymdh array info on day and month to set the indices instead of just incrementing a counter k as doing now.
d=dir('TRMM_*_newntcl.csv');
ymdh=cell2mat(textscan([d.name], ...
'TRMM_%4d_%2d_%2d%2d_newntcl.csv', ...
'collectoutput',true));
idx=find(ymdh(:,end)==0); % get the beginning day 0 hr locations
N=length(idx);
idx=[idx;length(ymdh)+1];
precip=nan(51681,31,12); % preallocate on num whole days
%fprintf('Building daily sums for %d: \n',ymdh(1,1))
dispstat('','init')
for i=1:length(idx)-1
isOK=~any(diff(ymdh(idx(i):idx(i+1)-1,3))) & idx(i+1)-idx(i)==8;
if ~isOK, continue, end % either aren't eight or it's a wrong day mixed in
%msg=sprintf(' Month: %2d Day: %2d',ymdh(idx(i),2:3));
%dispstat(msg,'keepthis')
data=zeros(51681,8); % preallocate based on num lat/lon points
for j=0:7 % ok, process each of these, they're one day's worth
%msg=sprintf('Reading hourly file: %s: ',d(idx(i)+j).name);
%dispstat(msg)
data(:,j+1)=csvread(d(idx(i)+j).name,0,2); % add hourly data
end
mo=ymdh(idx(i),2); % this month
da=ymdh(idx(i),3); % this day
precip(:,da,mo)=sum(data,2);
fn=d(idx(i)).name; fn(16:17)=[]; fn(1:1)='O';
%disp(fn)
csvwrite(fn,precip(:,k))
end
mTot=squeeze(sum(precip,2)); % monthly totals in each column 1-12
yTot=sum(mTot,2); % yearly total
Now just write mTot and yTot to files as you want them. As noted before, you'll probably want to do some checking also on whether there's data in every day of the month before accumulating that total and then for the months before the yearly total or you may have totals that aren't actually representative of the entire month/year.
Using nan to initialize instead of zero is one way, then any sums with NaN as a value are indicative altho there you've got to fix up the detail of how many days are in each month but that's pretty simple bookkeeping...
30 commentaires
Damith
le 24 Mar 2015
Reused d again, change the day index retrieved from the ymdh array to something else; that just wiped out the directory structure. That's a lesson in debugging, use the debugger and see what is going on...
BTW w/ that correction (I used da and mo instead of d and m), the above ran to completion with your subdirectory of files.
Again, you'll have to decide how to deal with the issue of missing data somehow; perhaps that's not an issue in general but the samples you've given do have at least one day missing a 3-hr period that was ignored so that means by similarity that that corresponding month/year is missing a day and hence also is missing which means that the year altogether would be missing. That seems like probably too onerous of a criterion but I can't tell you how to treat missing values; that's a problem you've got to address.
dpb
le 25 Mar 2015
for ii=1:length(m)-1
There is no "m"???
The array is complete already after the previous sum; simply write each column to a file of choice; the squeeze results in the original 3D array being converted to 2D as the collapsed previous 2nd dimension of 1 is removed.
At this point I'm going to leave accomplishing the above as "exercise for the student"; you need to develop some facility on your own, I'm not always going to be around. But, the loop if you do want a single file for every month and year goes after the sum, not before.
"...does not produce the monthly sum for the months of 2 and 4..."
That would indicate there's missing values at least one day within those months. That's what I outlined just a couple of comments ago that you've got to decide how to treat missing data--
"Again, you'll have to decide how to deal with the issue of missing data somehow; perhaps that's not an issue in general but the samples you've given do have at least one day missing a 3-hr period that was ignored so that means by similarity that that corresponding month/year is missing a day and hence also is missing which means that the year altogether would be missing. That seems like probably too onerous of a criterion but I can't tell you how to treat missing values; that's a problem you've got to address."
As that note says, if you carry thru to the higher levels the same logic of ignoring a missing hour or two meaning to ignore that day, then if that day is missing, the corresponding month isn't complete and so that month will be missing. Consequently, at the next step, that month is missing and so the year will be incomplete and the year will also be missing.
I don't (and can't) know what you want to do about missing values...you can pretend they're zero, or try to impute something for those missing times or you can simply live with the resulting holes. The data are what they are, I don't think there's anything wrong with the code altho not having that particular set of data I can't reproduce it here (and won't, this is, after all in the end, your research, not mine...)
If you want to "just see what happens" you can either
- Use zeros instead of nan to initialize the array which is the "assume zero" option, or
- Use nansum if have the Stat Toolbox to only sum the finite values (which is, in essence, the same assumption, just computed differently and that you do have the info to know which are and are not complete datasets that gets lost the other way)
Alternatively, you could go back and forget about the "missing hours" test and at least pick up partial days where they do exist, but I expect there will probably still be some holes where a full day may have been uncollected.
It's not necessarily an easy problem to solve cleanly...missing data are a pita albeit a fact of life. Again, I can't tell you what is most appropriate for you to do...
Damith
le 25 Mar 2015
Damith
le 25 Mar 2015
"due to NaN for months which do not have 31 days..."
Ah, yes. I didn't include it explicitly but my thinking was that one would use another logical array that overlays the rectangular array that is T for existing days in year by month. It would automagically account for leap year by looking at the year.
I tend to simply assume an implementer/end user will do the dirty work of finishing off a basic idea thrown together here, sorry...
ADDENDUM
I presume you've got this all working by now, and I've not done anything else beyond this point to sand off the rough edges but it came to me that perhaps the "clever" way is while doing the loop on days to zero fill the particular column/plane if there is data; then the summation will work as wanted for existing data but you'll still have the NaN indication for the missing locations. Still don't know how significant that's going to be in the larger scheme of things, but it's a starting point that will let you know rather than hiding a potential problem if simply either all zero fill or just nansum w/o checking.
Damith
le 26 Mar 2015
Damith
le 27 Mar 2015
The difference in the two is that the first simply increments a counter for the column irrespective of whether there is or is not data specifically for that day. The latter, otoh, puts the daily data into the column based on the day of the month.
ADDENDUM
Also remember that the former adds a new column for the same day for the subsequent month whereas the latter puts the total for the day in the same column but in the subsequent plane based on the actual month, 1-12.
I'd probably automate some checks--as suggested once very early on, you could add two rows to the first solution and keep the month and day for each column k in them. Then, use those when traversing the columns to look up the column/plane intersection of the second and see that all() of those corresponding elements are the same. I'm virtually certain it'll check out ok. END
I don't see any reason there should be any other difference between the two nor why there should be any problem in storing the data in the 3D array. Use the debugger to step through a few steps to satisfy yourself the indexing is correct but I'd first suspect k isn't the same as the day value is the cause of any discrepancy.
I suppose it's possible there's some other logic flaw but in all the testing I did here I didn't see any sign of it.
OK, I did the above test...I modified the original version using simply an index by adding
moda=zeros(2,N); % allocate space for da, mo
and inside the loop when the precip() array is computed and stored, added
mo=ymdh(idx(i),2); % this month
da=ymdh(idx(i),3); % this day
moda(1,k)=mo; moda(2,k)=da; % save the date
which is the same logic as the second for the month and day so that's consistent with its storage. I then saved this set of precipitation totals as precip1 to not overwrite the 3D precip array and ran the two versions on the largest set of data I have here which is that for 1998.
After that, I wrote the following test at the command line --
>> for i=1:length(moda)
if ~all(precip1(:,i)==precip(:,moda(2,k),moda(1,k))),
fprintf('Mo%d Da%d',moda(:,i)),end,end
>>
Note I got no output so that there were no discrepancies found(*); the storage in the 3D array is, commensurate to that in the 2D array and the data in the corresponding vectors when retrieved for the same month and date are identical, as expected.
I see no reason the 3D won't work correctly for every data set as written.
() Just to prove to meself the logic was actually doing what intended, I did things like put a counter in an *else clause to calculate the number of matches and checked that it did in the end equal the length of the array (45 in this case).
As said, I ran two scripts...the one that uses the incremented column irrespective of what the day really is and the second that builds the the 3D array; so that I still had both in memory when they were done, I just renamed precip in the former to precip1.
The if simply tests that all values in the column in the 3D array for the specific month and day are identical to those for the same month and day in the 2D array accounting for where the two are stored relative to each other. As noted, the linear column addressing doesn't account for which month/day it is, specifically, it just adds another column if there are data; if, for example, there's a day missing for Jan 2 of the first year then Jan 3 will be in column 2 and everybody else will also move up one. The same thing happens if there's any other missing days; the subsequent then will be that many days shifted from the nominal position of day-of-year. OTOH, the 3D solution has the advantage it does put all the data in the right location automagically, each 3rd day of the month in column 3, each month of the year in plane M for the correct month, 1 thru 12. The indexing in that loop simply extracts the day and month that were saved for each of the k columns and uses that to translate into the 3D array to get the same actual month/day data which are, of course, then shown to be identical.
The result of the test is to show that the logic building the 3D array works correctly; I don't see why you would want to do anything different...whatever the issue you've got has to be related to the fact that the columns you're comparing aren't the ones that should be compared owing to the storage location absolute column number difference outlined.
ADDENDUM
Both scripts computed the results independently from the .csv files; yes, that was the point of the exercise--to prove that the results are, in fact, identical if you compare the correct columns to each other.
dpb
le 27 Mar 2015
Well, w/o trying to debug what you've modified needlessly, my first guess is that the change you made is wrong. You're now mixing up the k indexing with the 3D and I'm betting it's not consistent that way w/o really working on it which I'm not spending time doing since the original 3D version was shown to be correct above. If you're going to use the 3D array, use it; if not do whatever you want however you see fit.
With the properly aligned 3D array wherein the months are all in the same column on top of each other, squeeze after the summation does "do the right thing"; it collapses the resulting one dimension leaving a subsequent 2D array.
To illustrate, we'll use some dummy data on a smaller scale; the principle is identical...
>> p=repmat([0.1:0.1:0.7],10,1); % small set of a vector to 2D
>> for i=2:3, p(:,:,i)=i*p(:,:,1); % add two more planes for months
>> p
p(:,:,1) =
0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000
0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000
0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000
...
0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000
0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000
p(:,:,2) =
0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000
0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000
0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000
...
0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000
0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000
0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 1.4000
p(:,:,3) =
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
0.3000 0.6000 0.9000 1.2000 1.5000 1.8000 2.1000
>>
So, above we've got a small test case of 7 days by 3 months...let's create the "monthly" and "yearly" sums...
>> mTot=sum(p,2)
mTot(:,:,1) =
2.8000
2.8000
...
2.8000
2.8000
2.8000
2.8000
mTot(:,:,2) =
5.6000
5.6000
5.6000
...
5.6000
mTot(:,:,3) =
8.4000
8.4000
...
8.4000
8.4000
>> whos mTot
Name Size Bytes Class Attributes
mTot 10x1x3 240 double
OK, the above is obviously the correct sum for the seven columns but it's spread out over three planes with the values for each coordinate a vector in that plane. That's inconvenient so instead
>> mTot=squeeze(mTot)
mTot =
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
2.8000 5.6000 8.4000
>>
So the squeeze operation removes the singleton second subscript and leaves us with a 2D array of the same values where now the 2nd dimension is that of the months. It's identical in result whether use two steps as above or collapse into one as I did in the original code.
Now, the yearly total is again the sum across the columns which will leave a single vector of length of the number of coordinates.
>> yTot=sum(mTot,2);
yTot =
16.8000
16.8000
16.8000
16.8000
16.8000
16.8000
16.8000
16.8000
16.8000
16.8000
>>
The size of the arrays is absolutely immaterial to the shape operations as outlined above the only difference in this example and the real thing is the size of the arrays; the resulting shape after the summation and dimension reduction is identical and is, as shown above, indeed what you want/need.
The only "trick" is again treating the shorter months and missing days; as I suggested if you fill the full column with zeros initially for each day that has data, then the only NaN left will be those for missing days and sum will reflect that in the totals.
dpb
le 28 Mar 2015
"...monthly values are WRONG for some months."
How about a particular month and value you think is in error?
I ran the identical code as that I had in the updated "Answer" http://www.mathworks.com/matlabcentral/answers/182710-creating-a-matrix-after-giving-initial-conditions#answer_172554 with the exception of a progress indicator of which file was being processed and the day being built which have no bearing on the actual logic and looked at a particular location...the first node for all days for the first month...
>> precip(1,:,1)
ans =
Columns 1 through 9
0 0.0400 0 0 0 0.9400 0 0 0
Columns 10 through 18
0 0 0 0.3900 0.0100 0 0 0 0
Columns 19 through 27
0 0 0.1600 0 0 8.8100 0.0200 2.1000 0
Columns 28 through 31
0 0.0200 0 0
That's the first month...day 24 seems interesting; it's got a sizable total accumulation so let's see if it's ok...
>> d24=dir('TRMM_2004_01_24*_newntcl.csv'); % return all files for Jan24
>> tot=0;for i=1:length(d24)
d=csvread(d24(i).name,0,2);
tot=tot+d(1); disp([d(1) tot]),end
0 0
0.1700 0.1700
8.6400 8.8100
0 8.8100
0 8.8100
0 8.8100
0 8.8100
0 8.8100
>>
Voila!! They match. That a given day/month matches since it's a generic routine seems pretty conclusive to me there's not a logic error in the building of the 3D array nor the summations. I have essentially 100% confidence the same result would hold for any combination of node and day and month. I can still only guess that you've got something amiss in the indexing.
Note there are only six months included and only the first 24 days of that month are complete in the data that was in the download unless if you somehow have additional data other than that and are getting some other values that aren't in the processing by a file-naming issue or such.
ADDENDUM
Oh, the other thing that would screw up a comparison would be to include the values from the incomplete June 25 data in your totals as compared to the processed data that requires a full day (or any other missing days; I glanced at the list as it was processed and it looked complete but I didn't check that religiously; it's possible there's another incomplete day in there that I didn't notice...but if you aren't culling out those incomplete days your comparisons will be different assuming there's any accumulation in one of those hourly files that was discarded because the day wasn't complete.
If that were the case then we're back to the previous discussion of your having to decide how you want to treat them.
ADDENDUM 2
Just to check on the monthly totals from raw data as well...
>> jTot=0;
>> dJan=dir('TRMM_2004_01_*_newntcl.csv');
>> for i=1:length(dJan)
p=csvread(dJan(i).name,0,2);
jTot=jTot+p(1);
if p(1)>0
disp([dJan(i).name ' ' num2str([p(1) jTot])]),end,end
TRMM_2004_01_0206_newntcl.csv 0.04 0.04
TRMM_2004_01_0612_newntcl.csv 0.94 0.98
TRMM_2004_01_1315_newntcl.csv 0.39 1.37
TRMM_2004_01_1412_newntcl.csv 0.01 1.38
TRMM_2004_01_2115_newntcl.csv 0.16 1.54
TRMM_2004_01_2403_newntcl.csv 0.17 1.71
TRMM_2004_01_2406_newntcl.csv 8.64 10.35
TRMM_2004_01_2515_newntcl.csv 0.02 10.37
TRMM_2004_01_2603_newntcl.csv 1.55 11.92
TRMM_2004_01_2606_newntcl.csv 0.55 12.47
TRMM_2004_01_2909_newntcl.csv 0.02 12.49
>> mTot(1,:)
ans =
Columns 1 through 9
12.4900 21.4800 3.0000 8.7200 27.4500 9.7000 0 0 0
Columns 10 through 12
0 0 0
>>
Again, the values add up correctly.
March looks ok to me...
>> mTot(1,:)
ans =
Columns 1 through 9
12.4900 21.4800 3.0000 8.7200 27.4500 9.7000 ...
Columns 10 through 12
0 0 0
>> dMar=dir('TRMM_2004_03_*_newntcl.csv');
>> length(dMar)/8 % check it's got full complement
ans =
31
>> mrTot=0;
>> for i=1:length(dMar),p=csvread(dMar(i).name,0,2);mrTot=mrTot+p(1);if p(1)>0,disp([dMar(i).name ' ' num2str([p(1) mrTot])]),end,end
TRMM_2004_03_0209_newntcl.csv 0.15 0.15
TRMM_2004_03_0212_newntcl.csv 0.02 0.17
TRMM_2004_03_0218_newntcl.csv 0.02 0.19
TRMM_2004_03_0821_newntcl.csv 0.04 0.23
TRMM_2004_03_1200_newntcl.csv 0.41 0.64
TRMM_2004_03_1306_newntcl.csv 0.17 0.81
TRMM_2004_03_1312_newntcl.csv 0.05 0.86
TRMM_2004_03_1618_newntcl.csv 0.01 0.87
TRMM_2004_03_1709_newntcl.csv 0.23 1.1
TRMM_2004_03_1721_newntcl.csv 0.01 1.11
TRMM_2004_03_2009_newntcl.csv 0.56 1.67
TRMM_2004_03_2115_newntcl.csv 0.09 1.76
TRMM_2004_03_2303_newntcl.csv 1.01 2.77
TRMM_2004_03_2412_newntcl.csv 0.01 2.78
TRMM_2004_03_2709_newntcl.csv 0.1 2.88
TRMM_2004_03_2915_newntcl.csv 0.04 2.92
TRMM_2004_03_3003_newntcl.csv 0.08 3
>>
That matches mTot(1,3) as expected. Try another random point...
>> idx-randi(length(precip))
idx =
9399
>> dMar=dir('TRMM_2004_03_*_newntcl.csv');
>> mrTot=0;
>> for i=1:length(dMar),p=csvread(dMar(i).name,0,2);mrTot=mrTot+p(idx);
if p(idx)>0,disp([dMar(i).name ' ' num2str([p(idx) mrTot])]),end,end
TRMM_2004_03_0809_newntcl.csv 0.01 0.01
TRMM_2004_03_0815_newntcl.csv 2.83 2.84
TRMM_2004_03_1300_newntcl.csv 0.04 2.88
TRMM_2004_03_1512_newntcl.csv 0.08 2.96
TRMM_2004_03_1806_newntcl.csv 0.02 2.98
TRMM_2004_03_2303_newntcl.csv 11.07 14.05
TRMM_2004_03_2306_newntcl.csv 0.04 14.09
TRMM_2004_03_2309_newntcl.csv 1.43 15.52
TRMM_2004_03_2706_newntcl.csv 0.03 15.55
TRMM_2004_03_3015_newntcl.csv 0.08 15.63
TRMM_2004_03_3103_newntcl.csv 0.29 15.92
>> mTot(idx,3)
ans =
15.9200
>>
I can't seem to find a value that doesn't compare ok as long as there are full days.
I don't understand your other comment regarding what you prefer; as requested the script ignores all days which do not have all eight 3-hr readings as we've discussed before.
Again, how you want to handle the missing values is a decision you have to make as to what it means to have data recorded for a partial day and then how that is going to be propagated up to monthly and yearly sums. I've no real input on how you should do that or whether you even should or not.
As previously noted, however, the zero-fill and the nansum options both equate to the same thing as presuming the data are complete with no precipitation for any of those hours/days which clearly isn't the right answer in general altho it undoubtedly is for specific hourly observations.
It's why I had suggested NaN for the initial full array allocation and then to zero-fill for a full column that is processed before storing the data. That then only propagates a NaN for sums that do include no day but avoids doing so for the months with less than 31 days without negative consequences on those months totals but will show you where there are holes in you cumulative data. As noted, I fear that may result in having no yearly totals because you probably have no records that are truly complete. If so, yet again, then, you're back to the problem of what's a reasonable alternative that doesn't corrupt the database too badly? I've no klew on what to tell you about that conundrum, unfortunately, sorry...
But, my conclusion is the code is working as intended to process the 3-hr data files to aggregate them; I don't see any case that actually generates a bogus result for the summations as compared to the individual files.
"As previously noted, however, the zero-fill and the nansum options both equate to the same thing as presuming the data are complete with no precipitation for any of those hours/days ..."
That is, once you accumulate the sums by either of these ways you've lost the indication of which were or were not full days; that piece of information has been lost. Consequently, it appears from all that anybody could tell by reading your newly-created data files that that is what the results were as measured in toto; it's a false impression, however, we know there are at least some days that have partial records that aren't being reported and we also know there are large chunks missing entirely (or at least with the data I've seen so far). That surely should be made clear to whoever is planning on using these data for anything other than just observing numbers where there happen to be some--surely one could make no claim of an annual total of any accuracy that way or compare to other historical data or whatever one had in mind to do with the data.
dpb
le 28 Mar 2015
OK, one last try to verify the previous conjecture--let's look at June.
>> juTot=0;
>> for i=1:length(dJun)
p=csvread(dJun(i).name,0,2);
juTot=juTot+p;
if any(p)>0,disp([dJun(i).name ]),end,end
TRMM_2004_06_0100_newntcl.csv
TRMM_2004_06_0106_newntcl.csv
TRMM_2004_06_0112_newntcl.csv
TRMM_2004_06_0118_newntcl.csv
...
TRMM_2004_06_2200_newntcl.csv
TRMM_2004_06_2206_newntcl.csv
TRMM_2004_06_2212_newntcl.csv
TRMM_2004_06_2218_newntcl.csv
TRMM_2004_06_2300_newntcl.csv
TRMM_2004_06_2306_newntcl.csv
TRMM_2004_06_2312_newntcl.csv
TRMM_2004_06_2318_newntcl.csv
TRMM_2004_06_2400_newntcl.csv
TRMM_2004_06_2406_newntcl.csv
TRMM_2004_06_2412_newntcl.csv
TRMM_2004_06_2418_newntcl.csv
TRMM_2004_06_2500_newntcl.csv
TRMM_2004_06_2506_newntcl.csv
TRMM_2004_06_2512_newntcl.csv
TRMM_2004_06_2518_newntcl.csv
>>
>> max(abs(juTot-mTot(:,6))) % oooh, there is a problem!!!
ans =
49.6600
>>
OK, now let's limit the loop to only the full days of 1-24...
>> juTot=0;
>> N=8*floor(length(dJun)/8)
for i=1:N
p=csvread(dJun(i).name,0,2);
juTot=juTot+p;
if any(p)>0,disp([dJun(i).name ]),end,end
TRMM_2004_06_0100_newntcl.csv
TRMM_2004_06_0106_newntcl.csv
TRMM_2004_06_0112_newntcl.csv
TRMM_2004_06_0118_newntcl.csv
TRMM_2004_06_0200_newntcl.csv
TRMM_2004_06_0206_newntcl.csv
TRMM_2004_06_0212_newntcl.csv
TRMM_2004_06_0218_newntcl.csv
TRMM_2004_06_0300_newntcl.csv
TRMM_2004_06_0306_newntcl.csv
...
TRMM_2004_06_2400_newntcl.csv
TRMM_2004_06_2406_newntcl.csv
TRMM_2004_06_2412_newntcl.csv
TRMM_2004_06_2418_newntcl.csv
>> i
i =
192
>> max(abs(juTot-mTot(:,6)))
ans =
8.5265e-14
>> all(abs(juTot-mTot(:,6))<5*eps(max(juTot)))
ans =
1
>>
OK, they're not absolutely identical owing to the difference in order in the summations between the script and the one-at-a-time processing, but the difference is in the expected magnitude of floating point roundoff.
NB: also that this test is accumulating the full-length summation rather than just a single index so we've proven the contention made earlier than the point index used is immaterial to testing the logic; testing one value is as good as testing them all for storage location.
It also proves the point made earlier that as long as one processes only complete data then the sums being built in the script are ok based on the storage location by column/plane by day/month; it's when you add in the values that are collected for a day that isn't complete that a real discrepancy occurs.
So, it's a decision you've got to make about how to handle the incomplete and missing data.
That's testing whether there is NaN in location that is valid for the day of month for the given month (remembering leap year of course). That's the idea I outlined earlier of the logic array to select those valid days. How much effort you must go to is dependent upon whether it's possible there are months with "holes" in them or whether it's always only whether there are 1:N<=M where N is the number of days of data and M is the number of days in the given month.
Another possible tack is to add a sum for the number of days that have been accumulated for each month by month in the script at the point the precip array is being populated; then you can simply check that the content of that array is the same as the expected number of days in the month for the given year.
Unfortunately, the simple expedient of zero-ing the day vector for the month doesn't work to prevent the partial months from being totalled; it then would include June in the 2004 sum so have to go to the (slight) extra effort.
The second idea above would be something like
nDays=zeros(12,1); % initialize before the outer loop
and then after the computation of the month,day variables mo, da insert
nDays(mo)=nDays(mo)+1;
This will give you the count of the actual full days that were found which you can compare to the expected number.
dpb
le 29 Mar 2015
Well, of course it does unless you make use of the information collected...as requested, I suggested (two, actually) a way to find out how to sum the full months only. It's an exercise-for-the-student to finish the detail of completing the summation of the subset of those for which
nDays(month)==Expected(month)
Somewhat against my better judgment (I think you need to develop some more critical thinking of solving the details of the logic problem outlined rather than relying on somebody else to provide complete solutions down to every detail), the following should work for any combination of missing days...
In the early portion of the script after building the ymdh array of dates/times from the file names insert
edpm=bsxfun(@eomday,double(ymdh(1)),[1:12].');
Then, after building the precip array, before the summation for the totals,
precip(isnan(precip(:,:,nDays==edpm)))=0;
Then revert to using builtin sum instead of nansum in
mTot=squeeze(sum(precip,2));
and joy should ensue.
I'll leave it to you to decipher why/how it works; hopefully from that exercise you'll gain a little bit of the thought process of using logical addressing within Matlab which is a key concept to grasp in effective use thereof.
ADDENDUM
The above will ensure complete months, if you do want to include any month in which there is a complete day, change the logic test from "==" to ">0". If you want there to be some minimum number of days in a month, then that would replace the 0, of course. In this case there's no need for the expected number of days/month, it's simply a test of whether there were any (or some number of) days counted.
I'll also reiterate the caution made previously in that if you do this, the data files created need to somewhere contain the information that some month(s) may be incomplete. This information is lost in the final output above; only keeping the result of the logic test for the specific month along with the totals will provide any way for the user of such data to know that is so from that file alone. Doing anything less is, in my opinion, a serious breach of scientific ethics.
One way, at a minimum, if you for some reason can't include the data in the files themselves would be to add an additional numeric code into the file name that would be the nDays(mo) value for the month. Then, at least the user would be able to check that against the expected days/month for complete datasets.
Similar arguments apply to the yearly totals, of course, but one presumes there may not be any of those that are complete in having every day of every month for the entire year. Still, encoding the days that are included would be significant information if can't include in the file the complete record of which days are missing/present.
dpb
le 31 Mar 2015
BTW, if you haven't yet figured this out, you could save tons of time if you were to run the above script once for each year subdirectory and then
save precip
in that working directory. Then you can simply
load precip
for that year and experiment with how to create and save the monthly and yearly sums to your heart's content and build whatever files from there without having to reread the same values over and over and over which is quite slow.
Catégories
En savoir plus sur Dates and Time dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!




