I am trying to load a csv file using the import tool. It takes forever (like a weekend was not enough...). I've included the screenshot of what I am doing. The file has numbers from H2 to AEQ639774. From A1 to AEQ1 I have headers. From A2 to G639774 I have identifiers. I was trying to first load the numbers into a numeric matrix, and then repeat the process for headers and identifiers separately. But not even this works. The file is 1.28 GB.. so big but not that big. My machine has 16gb ram so that should be enough. I am probably doing something wrong! Thanks in advance!

That sounds fishy. What version of matlab are you using? I assume the problem persists after exiting and rebooting Matlab. You could try in case 3rd party toolboxes are interfering. You could use an the data such as xlsread() which bypasses some of the processing done by the import tool.

Cannot Load CSV file

romulo alves le 31 Juil 2018

MATLAB R2017b

romulo alves le 31 Juil 2018

oh and I am able to import small parts of the data quickly if I choose only 20 to 30 cells. so I suppose the problem is that the file is too large? is there a more efficient way to do this?

xlsread and reshaping are not really helping.

I was also trying to follow this instructions https://nl.mathworks.com/matlabcentral/answers/231857-loading-large-csv-files

but it's not working either..

Adam Danz le 31 Juil 2018

What was the problem with xlsread()?

What was the problem with reading the file in chunks (as described in your link)?

If xlsread() causes problems when reading the entire numerical matrix within your csv file, first I'd like to know what the problem is, secondly, you could try extracting the data in chunks. You'd make a loop that pulls out 100 columns at a time, for example.

romulo alves le 31 Juil 2018

Modifié(e) : romulo alves le 31 Juil 2018

Ouvrir dans MATLAB Online

So, if I do xlsread('DOT.csv','H7:T20'), trying to extract only a little bit of numeric part, I get the message

Unable to read XLS file "path" File is not in recognized format.

If I do:

chunk_nRows = 2e4 ;
 % - Open file.
 fId  = fopen( 'DOT.csv' ) ;
 % - Read first line, convert to double, determine #columns.
 line  = fgetl( fId ) ;
 row   = sscanf( line, '%f,' )' ;
 nCols = numel( row ) ;
 % - Prealloc data, copy first row, init loop counter.
 data      = zeros( chunk_nRows, nCols ) ;
 data(1,:) = row ;
 rowCnt    = 1 ;
 % - Loop over rest of the file.
 while ~feof( fId )
    rowCnt = rowCnt + 1 ;
    % - Realloc + a chunk if rowCnt larger than data array.
    if rowCnt > size( data, 1 )
        fprintf( 'Realloc ..\n' ) ;
        data(size(data, 1)+chunk_nRows, nCols) = 0 ;
    end
    % - Read line, convert and store.
    line = fgetl( fId ) ;
    data(rowCnt,:) = sscanf( line, '%f,' )' ;
 end
 % - Truncate data to last row (truncate last chunk).
 data = data(1:rowCnt,:) ;
 % - Close file.
 fclose( fId ) ;

I get the message

Subscript indices must either be real positive integers or logicals.

I checked and the code stops when

rowCnt = 20001

romulo alves le 31 Juil 2018

One thing that might influence this is the fact that some cells, instead of blanks or numbers, have an "e" or an "r". Could this be it?

romulo alves le 31 Juil 2018

Modifié(e) : Walter Roberson le 31 Juil 2018

the dataset is here in case that helps

https://data.world/imf/direction-of-trade-statistics-dots

Walter Roberson le 31 Juil 2018

That suggests that the file has inconsistent format. Perhaps it repeats headers after 20000 rows, for example.

I suggest reading it with readtable(). If you have a new enough MATLAB, use detectImportOptions first.

Unfortunately registration is required to download that dataset.

romulo alves le 31 Juil 2018

Modifié(e) : romulo alves le 31 Juil 2018

Thanks. So with readtable() and detectImportOptions I am able to load specific columns. This takes over 1 min per column, though. Since I have over 800 columns, this takes over 13 hours. Furthermore, I have other csv files I will need to load, so I am looking at a long time of loading data. Is there a quicker alternative? Thank you!

Adam Danz le 31 Juil 2018

I'm not sure this would speed things up or not but have you tried importing the data as a cell array, then filtering out non-numerical elements, then filling all empties with NaN?

romulo alves le 31 Juil 2018

Modifié(e) : Walter Roberson le 31 Juil 2018

Thanks. I'll try. Meanwhile, if it is of any usefulness, I put the data on dropbox. Can be accessed via:

https://www.dropbox.com/s/ddecbcrx8d1yc8i/DOT_06-02-2018%2010-14-53-46_timeSeries.zip?dl=0

romulo alves le 31 Juil 2018

Ouvrir dans MATLAB Online

ok. so the readtable thing worked. Apparently the time is not proportional to the number of columns as I was assuming. the problem now is that even numbers appear as

'3000'

instead of

3000

Is it possible to go around this?

Walter Roberson le 31 Juil 2018

Weird, the quotes around the numbers are in the csv file itself.

I will see if I can work around it.

Adam Danz le 31 Juil 2018

Modifié(e) : Adam Danz le 31 Juil 2018

Ouvrir dans MATLAB Online

By default, the variables created by readtable() are double when the entire column is numeric, or cell arrays of character vectors when any element in a column is not numeric. This is where detectImportOptions() comes in handy. Have you tried that recommendation already?

Off the top of my head, I see two options. The first is to import cleaner data using detectImportOption(). The second is import the dirty data you've got and then clean it. UPDATE: If the quotes are part of the actual data as Walter discovered, the 2nd option may be the better of the two.

If your data is all in a cell array with mixed strings and doubles, you could use str2double() as in this example.

dirty = {'3000'; 3000; 'r'; 2000; '1000'}; 
clean = str2double(dirty); 
isDbl = cellfun(@isnumeric, dirty); 
clean(isDbl) = [dirty{isDbl}]; 
clean =
        3000
        3000
         NaN
        2000
        1000

Walter Roberson le 31 Juil 2018

The 'e' and 'r' are probably the reason that most numbers are coded as if they are strings.

What do you want done with the 'e' and 'r' ? Is it okay to treat both of them the same way as empty cells, by changing all three of them into NaN ?

Walter Roberson le 1 Août 2018

The file turns out to be UTF8 encoded, because it contains accented characters at various points. That leads to some problems.

I started working with reading in the entire file at one time to process as a single string (there can be a lot of advantages to working that way), but I encountered a Mathworks bug with native2unicode at the point of 1 gigabyte of decoded characters.

Cannot Load CSV file

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponse acceptée

15 commentaires
Afficher 13 commentaires plus anciens Masquer 13 commentaires plus anciens

Plus de réponses (0)

Catégories

Tags

Community Treasure Hunt

Cannot Load CSV file

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponse acceptée

15 commentaires Afficher 13 commentaires plus anciens Masquer 13 commentaires plus anciens

Plus de réponses (0)

Catégories

Tags

Voir également

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

15 commentaires
Afficher 13 commentaires plus anciens Masquer 13 commentaires plus anciens