Why does csvread behave differently for large csv files?

Question

0 votes

I have two csv files that I'm trying to read in. The first contains one row of integers, the second contains one row of floats.

They are both formatted in the same way (with a trailing comma):

int_val_1,int_val_2,...,int_val_n,
float_val_1,float_val_2,...,float_val_m,

As I understand it, csvread should produce a row matrix with an extra 0 at the end (due to the trailing comma). In my case, however, csvread produces a column matrix without an extra 0 for the first file, and a row matrix with an extra 0 for the second file. This only happens if the first file is large (e.g., 589824 integers). If there are a small number of integers, it behaves as expected.

What's going on?

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Jeremy Hughes le 8 Juin 2015

Modifié(e) : Jeremy Hughes le 8 Juin 2015

Ouvrir dans MATLAB Online

Hi Peter,

You have run into an unfortunate limitation in the way csvread detects the number of columns in the file. Since your file is one long row, csvread assumes it's all one never-ending string of data. (at 100,000 columns, as Per discovered below, it stops counting and just returns a column.)

If you want to get consistent results on the output shape, you can call textscan in the following way.

fid = fopen(filename);
[data] = textscan(fid,'%f','Delimiter',',','EndOfLine','\r\n');
fclose(fid);

The variable "data" will be a cell array containing a column of numbers. If you need a row, just pull it out of the cell array and transpose;

data = (data{1})';

I hope this helps,

Jeremy

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

per isakson le 5 Juin 2015

Modifié(e) : per isakson le 5 Juin 2015

Ouvrir dans MATLAB Online

0 votes

I reproduced your result on R2013a, Win7

>> [CR,FS] = cssm(1e5); whos('CR','FS')
  Name           Size             Bytes  Class     Attributes
  CR        100000x1             800000  double              
  FS        100000x1             800000  double              
>> [CR,FS] = cssm(1e3); whos('CR','FS')
  Name         Size              Bytes  Class     Attributes
  CR           1x1001             8008  double              
  FS        1000x1                8000  double

where

function    [ CR, FS ] = cssm( N )
    str = repmat( '1.1,', 1, N );
    fid = fopen( 'cssm.txt', 'w' );
    fprintf( fid, '%s', str );
    fclose( fid );
    CR  = csvread( 'cssm.txt' );
    fid = fopen( 'cssm.txt', 'r' );
    FS  = fscanf( fid, '%f,' );
    fclose( fid );
end

"As I understand it, csvread should produce a row matrix with an extra 0" &nbsp I didn't find that stated in in the documentation of csvread

csvread is based on textscan and contains a bit of automagic. I guess, it was never intended for rows that long, i.e. files without new lines.

&nbsp

without the ending comma

And without the ending comma, cvsread returns a row for the large file.

>> [CR,FS] = cssm(1e5); whos('CR','FS')
  Name           Size                 Bytes  Class     Attributes
  CR             1x100000            800000  double              
  FS        100000x1                 800000  double

&nbsp

textscan with empty formatSpec

csvread calls textscan with formatSpec set to an empty string. That option of textscan is not documented. It makes a difference in this special case.

>> [CR,FS,TS1,TS2] = cssm(1e3); whos('CR','FS','TS1','TS2')
  Name         Size              Bytes  Class     Attributes
  CR           1x1001             8008  double              
  FS        1000x1                8000  double              
  TS1       1000x1                8000  double              
  TS2          1x1001             8008  double              
>> [CR,FS,TS1,TS2] = cssm(1e5); whos('CR','FS','TS1','TS2')
  Name           Size             Bytes  Class     Attributes
  CR        100000x1             800000  double              
  FS        100000x1             800000  double              
  TS1       100000x1             800000  double              
  TS2       100000x1             800000  double

where

function    [ CR, FS, TS1, TS2 ] = cssm( N )
    str = repmat( '1.1,', 1, N );
    fid = fopen( 'cssm.txt', 'w' );
    fprintf( fid, '%s', str(1:end) );
    fclose( fid );
    CR  = csvread( 'cssm.txt' );
    fid = fopen( 'cssm.txt', 'r' );
    FS  = fscanf( fid, '%f,' );
    fclose( fid );
    fid = fopen( 'cssm.txt', 'r' );
    cac = textscan( fid, '%f', 'Delimiter',','            ... 
                  , 'CollectOutput',true, 'EmptyValue',999 );
    fclose( fid );
    TS1 = cac{:};
    fid = fopen( 'cssm.txt', 'r' );
    cac  = textscan( fid, '', 'Delimiter',','               ...
                   , 'CollectOutput',true, 'EmptyValue',999 );
    fclose( fid );
    TS2 = cac{:};
end

&nbsp

For the large file, all of the functions and options I tested fails to recognize the ending comma.

2 commentaires
Afficher Aucune Masquer Aucune

Peter le 5 Juin 2015

Ouvrir dans MATLAB Online

Interesting. While it's not on the documentation page,

help csvread

produces

csvread fills empty delimited fields with zero.  Data files where
    the lines end with a comma will produce a result with an extra last 
    column filled with zeros.

per isakson le 5 Juin 2015

Modifié(e) : per isakson le 5 Juin 2015

The test suites at The MathWorks don't always cover all the edge cases, I guess.

Connectez-vous pour commenter.

Why does csvread behave differently for large csv files?

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Réponse acceptée

2 commentaires
Afficher Aucune Masquer Aucune

Plus de réponses (0)

Catégories

Tags

Community Treasure Hunt

Why does csvread behave differently for large csv files?

1 commentaire Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Réponse acceptée

2 commentaires Afficher Aucune Masquer Aucune

Plus de réponses (0)

Catégories

Tags

Voir également

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

2 commentaires
Afficher Aucune Masquer Aucune