NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

20 vues (au cours des 30 derniers jours)
per isakson
per isakson le 5 Mai 2012
Modifié(e) : per isakson le 16 Mai 2015
.
Question: Have I done my homework well enough to choose HDF5 and stop thinking about alternatives?
One more question: Which are the problems with HDF5 that I have overlooked? Will I face unpleasant surprises?
Currently, I store time series data from building automation systems, BAS, in large structures, often named X, in mat-files. Each time series is stored in one field. I use the denomination, Qty, for these timeseries. A typical X has 1000 fields and is 100MB and larger. I have used that "format" for more than ten years. However, I search for something better.
Goals: The user of a visualization tool shall have a huge amount of time series data at the finger tips. Read and write data-files with non-Matlab applications
What I done sofar:
  1. Experimented and used a system based on 128KB memmapfiles. Each time series is stored in a series of memmapfiles. Some metadata is embedded in the filename. It required too much coding and I failed to make it fast enough. Skipped!
  2. Studied some FEX-contributions; Waterloo File and Matrix Utilities; HDS-Toolbox(RNEL-DB); and ... . I share their description of the problem and the goal, but ... and a bit too smart to my capacity.
  3. Googled for NetCDF and HDF; decided to try NetCDF; an experiment with Matlabs high level API (ncwrite, ncread, ...); experienced very poor performance or worse.
  4. Searched in FEX for NetCDF and HDF5. There are 21 and 13 hits, respectively.
  5. A performance test. I used a structure, X, with 1346 fields each holding a .<66528x1 double> time series. The total size of X is 0.7GB. R2012a, Windows7, 64bit. The test included writing the data of the X-structure to the file in question (with X2hdf) and reading the data back to a -structure (with hdf2X). Corresponding functions for NetCDF are nearly identical with "h5" replaced by "nc". With NetCDF, I used the format, netcdf4_classic, and "'Dimensions', { 'qty', len_time }", i.e. a fixed and limited length.
Execution time in seconds
--------------------------------------
Method write read
HDF5 32.6 2.8
NetCDF(1) inf inf
save,load(2) 24.4 7.3
fwrite,fread(3) 3.8 1.3
read_hdf (FEX) 3.3
read_netcdf (FEX) 8.1
matfile(4) 74 196
--------------------------------------
  1. the result with NetCFD is strange. "inf" stands for two order of magnitude longer than the corresponding values for HDF5. NetCDF uses ...
  6 commentaires
Sean de Wolski
Sean de Wolski le 9 Mai 2012
That's why it is slow. Using a structure, the _entire_ structure has to be read into memory.
per isakson
per isakson le 9 Mai 2012
Loading the structure to memory takes 7.3 seconds. However, that is not included in the test of matfile. The structure is loaded beforehand and passed to the function X2matfile.

Connectez-vous pour commenter.

Réponses (3)

Sean de Wolski
Sean de Wolski le 7 Mai 2012
Have you looked at the MATFILE class in newer ML releases? It allows you the ability to access variables and pieces of variables of a mat-file (hdf5).
This would require creating many variables to be efficient, i.e: each time series would be its own variable, you could store the metadata in the variable name as you described above. I know this is typically frowned upon (a1,a2,...an) but it would give you quick and easy access to what you need.
Just a thought, I may be completely off base and I apolgize if I am.
  6 commentaires
Oleg Komarov
Oleg Komarov le 8 Mai 2012
Or you can create a m by 2 matrix where you concatenate vertically several time series. Then store a master file which retains start and end of each time series. This is basically the approach I would also use with fread/fwrite.
Sean de Wolski
Sean de Wolski le 8 Mai 2012
Yes, Oleg's approach would work well, pad with nans for values you don't have. Store the metadata in a separate matfile or cell array.

Connectez-vous pour commenter.


T.
T. le 16 Jan 2013
Modifié(e) : T. le 16 Jan 2013
I have also done a of experiments with the performance of netCDF within matlab. Some findings:
  • The matlab high level functions ncread and ncwrite have some performance issues by design: every command requires matlab to read the header of the netCDF file in order to determine the command to pass to the low level functions netcdf.getVar, netcdf.putVar etc.
  • The time it takes to read the header of a netCDF file is much greater for netCDF4 (which is HDF5) than for netCDF3, as netCDF3 is much simpler. Also, the complexity of the header increases with the number of variables in a file,; tens is usually workable, hundreds gives a very poor performance.
So to improve netCDF performance, try using version 3 if you can. Otherwise, try calling the low level functions netcdf.xxx instead of the high level functions.
What matlab would need (IMHO) is a high level, built in, object oriented, function to deal with netCDF files. In that function the netCDF file stays open, and the header is cached.
Here is some example code to illustrate the problem
for format = {'classic','netcdf4'}
fprintf(1,'\nFormat = = %s\n',format{:});
if exist('test.nc','file')
delete('test.nc')
end
nVars = 100;
for jj = 0:4
fprintf(1,'\nvariables = %d\n',nVars * (jj+1));
for ii = (1:nVars)+nVars * jj
nccreate('test.nc',sprintf('var%03.0f',ii),...
'Dimensions',{'r' 400 'c' 1},...
'Format',format{:});
end
for ii = (1:50:nVars)+nVars * jj
ncwrite('test.nc',sprintf('var%03.0f',ii),reshape(peaks(20),[],1));
end
for ii = (1:50:nVars)+nVars * jj
tic
ncread('test.nc',sprintf('var%03.0f',ii));
toc
end
end
end

Malcolm Lidierth
Malcolm Lidierth le 3 Mar 2013
@Per
I suspect some of the problems with memmapfile might be related to using multiple 128KB memmapfiles. Each requires system resources. The Waterloo File Utilities grew out of the sigTOOL project where I had a similar issue. In that case, each channel was represented by a memmmapfile object, but there might be many hundreds of channels. The "trick" I used was to was to dynamically instantiate the memmapfile instances only on demand (not when the file was first accessed) and to destroy them when not needed. That has allowed sigTOOL uses to work with files of many Gb.
With an HDF5 file, you can still use memory mapping by retrieving the byte offset to your data if:
  1. The data are not chunked
  2. The data are not compressed
This is a limitation of the API rather than the file format I believe, and you could use external mechanisms to break up large data files into separate components leaving HDF5 not knowing about the "chunking" internally and use external compression before writing the data.
My solution in the dev version of sigTOOL is to use a folder, not a file for the data. Each folder, has a few cross-referenced files allowing me to mix *.mat, *.bin, *.hdf5, *.xml etc. It's ugly perhaps, and raises synch issues, but it allows me to take advantage of the best format for different data sets without tying me to their limitations.
Regards ML

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by