NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

Question

per isakson le 5 Mai 2012

2
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user

Modifié(e) : per isakson le 16 Mai 2015

Ouvrir dans MATLAB Online

.

Question: Have I done my homework well enough to choose HDF5 and stop thinking about alternatives?

One more question: Which are the problems with HDF5 that I have overlooked? Will I face unpleasant surprises?

Currently, I store time series data from building automation systems, BAS, in large structures, often named X, in mat-files. Each time series is stored in one field. I use the denomination, Qty, for these timeseries. A typical X has 1000 fields and is 100MB and larger. I have used that "format" for more than ten years. However, I search for something better.

Goals: The user of a visualization tool shall have a huge amount of time series data at the finger tips. Read and write data-files with non-Matlab applications

What I done sofar:

Experimented and used a system based on 128KB memmapfiles. Each time series is stored in a series of memmapfiles. Some metadata is embedded in the filename. It required too much coding and I failed to make it fast enough. Skipped!
Studied some FEX-contributions; Waterloo File and Matrix Utilities; HDS-Toolbox(RNEL-DB); and ... . I share their description of the problem and the goal, but ... and a bit too smart to my capacity.
Googled for NetCDF and HDF; decided to try NetCDF; an experiment with Matlabs high level API (ncwrite, ncread, ...); experienced very poor performance or worse.
Searched in FEX for NetCDF and HDF5. There are 21 and 13 hits, respectively.
A performance test. I used a structure, X, with 1346 fields each holding a .<66528x1 double> time series. The total size of X is 0.7GB. R2012a, Windows7, 64bit. The test included writing the data of the X-structure to the file in question (with X2hdf) and reading the data back to a -structure (with hdf2X). Corresponding functions for NetCDF are nearly identical with "h5" replaced by "nc". With NetCDF, I used the format, netcdf4_classic, and "'Dimensions', { 'qty', len_time }", i.e. a fixed and limited length.

    Execution time in seconds
    --------------------------------------
    Method              write       read       
    HDF5                32.6        2.8
    NetCDF(1)           inf         inf
    save,load(2)        24.4        7.3
    fwrite,fread(3)     3.8         1.3
    read_hdf (FEX)                  3.3
    read_netcdf (FEX)               8.1
    matfile(4)          74          196
    --------------------------------------

the result with NetCFD is strange. "inf" stands for two order of magnitude longer than the corresponding values for HDF5. NetCDF uses ...

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

Sean de Wolski le 9 Mai 2012

That's why it is slow. Using a structure, the _entire_ structure has to be read into memory.

per isakson le 9 Mai 2012

Loading the structure to memory takes 7.3 seconds. However, that is not included in the test of matfile. The structure is loaded beforehand and passed to the function X2matfile.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Sean de Wolski le 7 Mai 2012

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user#answer_47018

Have you looked at the MATFILE class in newer ML releases? It allows you the ability to access variables and pieces of variables of a mat-file (hdf5).

This would require creating many variables to be efficient, i.e: each time series would be its own variable, you could store the metadata in the variable name as you described above. I know this is typically frowned upon (a1,a2,...an) but it would give you quick and easy access to what you need.

Just a thought, I may be completely off base and I apolgize if I am.

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

Oleg Komarov le 8 Mai 2012

Or you can create a m by 2 matrix where you concatenate vertically several time series. Then store a master file which retains start and end of each time series. This is basically the approach I would also use with fread/fwrite.

Sean de Wolski le 8 Mai 2012

Yes, Oleg's approach would work well, pad with nans for values you don't have. Store the metadata in a separate matfile or cell array.

Connectez-vous pour commenter.

Answer 2

T. le 16 Jan 2013

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user#answer_71305

Modifié(e) : T. le 16 Jan 2013

Ouvrir dans MATLAB Online

I have also done a of experiments with the performance of netCDF within matlab. Some findings:

The matlab high level functions ncread and ncwrite have some performance issues by design: every command requires matlab to read the header of the netCDF file in order to determine the command to pass to the low level functions netcdf.getVar, netcdf.putVar etc.
The time it takes to read the header of a netCDF file is much greater for netCDF4 (which is HDF5) than for netCDF3, as netCDF3 is much simpler. Also, the complexity of the header increases with the number of variables in a file,; tens is usually workable, hundreds gives a very poor performance.

So to improve netCDF performance, try using version 3 if you can. Otherwise, try calling the low level functions netcdf.xxx instead of the high level functions.

What matlab would need (IMHO) is a high level, built in, object oriented, function to deal with netCDF files. In that function the netCDF file stays open, and the header is cached.

Here is some example code to illustrate the problem

for format = {'classic','netcdf4'}
    fprintf(1,'\nFormat = = %s\n',format{:});
    if exist('test.nc','file')
        delete('test.nc')
    end
    nVars = 100;
    for jj = 0:4
        fprintf(1,'\nvariables = %d\n',nVars * (jj+1));
        for ii = (1:nVars)+nVars * jj
            nccreate('test.nc',sprintf('var%03.0f',ii),...
                'Dimensions',{'r' 400 'c' 1},...
                'Format',format{:});
        end
          for ii = (1:50:nVars)+nVars * jj
              ncwrite('test.nc',sprintf('var%03.0f',ii),reshape(peaks(20),[],1));
          end
          for ii = (1:50:nVars)+nVars * jj
              tic
              ncread('test.nc',sprintf('var%03.0f',ii));
              toc
          end
      end
  end

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Answer 3

Malcolm Lidierth le 3 Mar 2013

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user#answer_77175

@Per

I suspect some of the problems with memmapfile might be related to using multiple 128KB memmapfiles. Each requires system resources. The Waterloo File Utilities grew out of the sigTOOL project where I had a similar issue. In that case, each channel was represented by a memmmapfile object, but there might be many hundreds of channels. The "trick" I used was to was to dynamically instantiate the memmapfile instances only on demand (not when the file was first accessed) and to destroy them when not needed. That has allowed sigTOOL uses to work with files of many Gb.

With an HDF5 file, you can still use memory mapping by retrieving the byte offset to your data if:

The data are not chunked
The data are not compressed

This is a limitation of the API rather than the file format I believe, and you could use external mechanisms to break up large data files into separate components leaving HDF5 not knowing about the "chunking" internally and use external compression before writing the data.

My solution in the dev version of sigTOOL is to use a folder, not a file for the data. Each folder, has a few cross-referenced files allowing me to mix *.mat, *.bin, *.hdf5, *.xml etc. It's ugly perhaps, and raises synch issues, but it allows me to take advantage of the best format for different data sets without tying me to their limitations.

Regards ML

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

Réponses (3)

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

6 commentaires Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

Réponses (3)

6 commentaires Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

6 commentaires
Afficher 4 commentaires plus anciensMasquer 4 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens