Combine data from multiple netCDF files
12 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Michiel Smit
le 8 Nov 2021
Commenté : Michiel Smit
le 9 Nov 2021
Dear reader,
I have multiple datasets for hourly wave height in the NE Atlantic Ocean that is stored in netCDF files for each month (Jan 1990 till Dec 2019). To properly work with the data I wish to combine these datasets such that I get the following output:
Vectors for longitude and latitude
Datetime array from Jan 1990 till Dec 2019
3D data for wave height stored as longitude x latitude x time\
For this, I have written a script (see below). However, at the moment this script is very inefficient and it takes hours to do the job. Is there any way to optimize this script or is there another way to efficiently combine data from multiple NETCDF files?
Your help would be very much appreciated.
param = 'hs';
year = (1990:2019)';
years = num2str(year);
month = ['01'; '02'; '03'; '04'; '05'; '06'; '07'; '08'; '09'; '10'; '11'; '12'];
%% Step 2: Process the data
% Load longitude bands
longitude = ncread(fileIN,'longitude');
% Load latitude bands
latitude = ncread(fileIN,'latitude');
% Create the timeTotal and paramTotal here
timeTotal = datetime(1900,1,1,0,0,0,'Format','dd-MMM-yyyy HH:mm:ss');
paramDataTotal = zeros(length(longitude),length(latitude),1);
% The following for loops first checks if the file exists, then extracts
% the time and param arrays and adds those to the "total dataset."
for i = 1:length(years(:,1));
for j = 1:length(month(:,1));
fileIN = ['WW3-ATNE-10M_' years(i,:) month(j,:) '_hs.nc']
if isfile([Directory_load '\' fileIN]) == 1
% Load all time arrays time array
time = ncread(fileIN,'time');
epoch = datetime(1990,01,01);
time = epoch + days(time);
% add time array to the total time array
timeTotal = cat(1,timeTotal,time);
% Load parameter data
paramData = ncread(fileIN,param);
% add param data to the total 3D paramdata
paramDataTotal = cat(3,paramDataTotal,paramData);
else
continue
end
end
end
% Remove the first entry in the total data.
timeTotal(1) = [];
paramDataTotal(:,:,1) = [];
0 commentaires
Réponse acceptée
Seth Furman
le 8 Nov 2021
Take a look at the documentation for reading large files and big data.
One simple approach would be to read your data into a tall timetable, e.g.
% Create some sample NetCDF files.
createSampleNetCDFFiles();
ls data
% Read data as a tall timetable.
ds = fileDatastore("data", "ReadFcn", @readNetCDF, "UniformRead", true);
tTall = tall(ds)
function t = readNetCDF(filename)
Time = datetime(1990, 1, 1) + days(ncread(filename, "time"));
Latitude = ncread(filename, "time");
Longitude = ncread(filename, "time");
Data = ncread(filename, "data");
t = timetable(Time, Latitude, Longitude, Data);
end
function createSampleNetCDFFiles()
rng default
y = 1990;
latitude = sort(180 * rand([10 1]) - 90);
longitude = sort(360 * rand([10 1]) - 180);
mkdir data
cd data
for m = 1:12
filename = sprintf("WW3-ATNE-10M_%d%02d_hs.nc", y, m);
d = 28 * sort(rand([10 1]));
time = days(datetime(y, m, 0) + days(d-1) - datetime(1990, 1, 1));
data = 1000 * rand(10, 2);
nccreate(filename, "time", "Dimensions", {'r', size(time,1), 'c', size(time,2)});
nccreate(filename, "data", "Dimensions", {'r', 'c1', size(data,2)});
nccreate(filename, "latitude", "Dimensions", {'r', 'c2', size(latitude,2)});
nccreate(filename, "longitude", "Dimensions", {'r', 'c3', size(longitude,2)});
ncwrite(filename, "time", time);
ncwrite(filename, "data", data);
ncwrite(filename, "latitude", latitude);
ncwrite(filename, "longitude", longitude);
end
cd ..
end
Plus de réponses (1)
Kelly Kearney
le 8 Nov 2021
What are the dimensions of the total dataset? And do you really need to hold the entire thing in memory at once?
I think you could speed things up in your current script by preallocating the third dimension of paramDataTotal; right now, the constant expansion of the array in each loop iteration requires Matlab to reallocate memory, which can slow things down considerably. You may want to take a look at some of the netCDF-reading tools in the Climate Data Toolbox (particularly ncstruct.m); they are designed to optimize reading in netCDF data from a series of files.
That said, reading in the entirety of a large dataset may be inherently slow regardless of how optimized your code is, simply due to the size. Which brings me back to the question of whether you really need to hold the entire dataset in memory at once. I find it often makes more sense to rethink the rest of my calculations and only read in specific hyperslabs of data as needed (again, ncstruct may simplify the reading-in syntax for you). This read-as-needed technique is similar to how talltable and other Matlab large-dataset functions work under the hood... unfortunately, tabular constructs aren't ideal for working with 3D data, in my opinion.
Voir également
Catégories
En savoir plus sur NetCDF dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!