I am trying to load a csv file using the import tool.
It takes forever (like a weekend was not enough...).
I've included the screenshot of what I am doing.
The file has numbers from H2 to AEQ639774. From A1 to AEQ1 I have headers. From A2 to G639774 I have identifiers.
I was trying to first load the numbers into a numeric matrix, and then repeat the process for headers and identifiers separately. But not even this works.
The file is 1.28 GB.. so big but not that big.
My machine has 16gb ram so that should be enough.
I am probably doing something wrong!
Thanks in advance!

 Réponse acceptée

Adam Danz
Adam Danz le 31 Juil 2018

0 votes

That sounds fishy. What version of matlab are you using? I assume the problem persists after exiting and rebooting Matlab.
You could try rehashing the toolbox cache in case 3rd party toolboxes are interfering.
You could use an alternative method of importing the data such as xlsread() which bypasses some of the processing done by the import tool.

15 commentaires

romulo alves
romulo alves le 31 Juil 2018
MATLAB R2017b
romulo alves
romulo alves le 31 Juil 2018
oh and I am able to import small parts of the data quickly if I choose only 20 to 30 cells. so I suppose the problem is that the file is too large? is there a more efficient way to do this?
xlsread and reshaping are not really helping.
I was also trying to follow this instructions https://nl.mathworks.com/matlabcentral/answers/231857-loading-large-csv-files
but it's not working either..
Adam Danz
Adam Danz le 31 Juil 2018
What was the problem with xlsread()?
What was the problem with reading the file in chunks (as described in your link)?
If xlsread() causes problems when reading the entire numerical matrix within your csv file, first I'd like to know what the problem is, secondly, you could try extracting the data in chunks. You'd make a loop that pulls out 100 columns at a time, for example.
romulo alves
romulo alves le 31 Juil 2018
Modifié(e) : romulo alves le 31 Juil 2018
So, if I do xlsread('DOT.csv','H7:T20'), trying to extract only a little bit of numeric part, I get the message
Unable to read XLS file "path" File is not in recognized format.
If I do:
chunk_nRows = 2e4 ;
% - Open file.
fId = fopen( 'DOT.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
row = sscanf( line, '%f,' )' ;
nCols = numel( row ) ;
% - Prealloc data, copy first row, init loop counter.
data = zeros( chunk_nRows, nCols ) ;
data(1,:) = row ;
rowCnt = 1 ;
% - Loop over rest of the file.
while ~feof( fId )
rowCnt = rowCnt + 1 ;
% - Realloc + a chunk if rowCnt larger than data array.
if rowCnt > size( data, 1 )
fprintf( 'Realloc ..\n' ) ;
data(size(data, 1)+chunk_nRows, nCols) = 0 ;
end
% - Read line, convert and store.
line = fgetl( fId ) ;
data(rowCnt,:) = sscanf( line, '%f,' )' ;
end
% - Truncate data to last row (truncate last chunk).
data = data(1:rowCnt,:) ;
% - Close file.
fclose( fId ) ;
I get the message
Subscript indices must either be real positive integers or logicals.
I checked and the code stops when
rowCnt = 20001
romulo alves
romulo alves le 31 Juil 2018
One thing that might influence this is the fact that some cells, instead of blanks or numbers, have an "e" or an "r". Could this be it?
romulo alves
romulo alves le 31 Juil 2018
Modifié(e) : Walter Roberson le 31 Juil 2018
the dataset is here in case that helps
Walter Roberson
Walter Roberson le 31 Juil 2018
That suggests that the file has inconsistent format. Perhaps it repeats headers after 20000 rows, for example.
I suggest reading it with readtable(). If you have a new enough MATLAB, use detectImportOptions first.
Unfortunately registration is required to download that dataset.
romulo alves
romulo alves le 31 Juil 2018
Modifié(e) : romulo alves le 31 Juil 2018
Thanks. So with readtable() and detectImportOptions I am able to load specific columns. This takes over 1 min per column, though. Since I have over 800 columns, this takes over 13 hours. Furthermore, I have other csv files I will need to load, so I am looking at a long time of loading data. Is there a quicker alternative? Thank you!
Adam Danz
Adam Danz le 31 Juil 2018
I'm not sure this would speed things up or not but have you tried importing the data as a cell array, then filtering out non-numerical elements, then filling all empties with NaN?
romulo alves
romulo alves le 31 Juil 2018
Modifié(e) : Walter Roberson le 31 Juil 2018
Thanks. I'll try. Meanwhile, if it is of any usefulness, I put the data on dropbox. Can be accessed via:
romulo alves
romulo alves le 31 Juil 2018
ok. so the readtable thing worked. Apparently the time is not proportional to the number of columns as I was assuming. the problem now is that even numbers appear as
'3000'
instead of
3000
Is it possible to go around this?
Walter Roberson
Walter Roberson le 31 Juil 2018
Weird, the quotes around the numbers are in the csv file itself.
I will see if I can work around it.
Adam Danz
Adam Danz le 31 Juil 2018
Modifié(e) : Adam Danz le 31 Juil 2018
By default, the variables created by readtable() are double when the entire column is numeric, or cell arrays of character vectors when any element in a column is not numeric. This is where detectImportOptions() comes in handy. Have you tried that recommendation already?
Off the top of my head, I see two options. The first is to import cleaner data using detectImportOption(). The second is import the dirty data you've got and then clean it. UPDATE: If the quotes are part of the actual data as Walter discovered, the 2nd option may be the better of the two.
If your data is all in a cell array with mixed strings and doubles, you could use str2double() as in this example.
dirty = {'3000'; 3000; 'r'; 2000; '1000'};
clean = str2double(dirty);
isDbl = cellfun(@isnumeric, dirty);
clean(isDbl) = [dirty{isDbl}];
clean =
3000
3000
NaN
2000
1000
Walter Roberson
Walter Roberson le 31 Juil 2018
The 'e' and 'r' are probably the reason that most numbers are coded as if they are strings.
What do you want done with the 'e' and 'r' ? Is it okay to treat both of them the same way as empty cells, by changing all three of them into NaN ?
Walter Roberson
Walter Roberson le 1 Août 2018
The file turns out to be UTF8 encoded, because it contains accented characters at various points. That leads to some problems.
I started working with reading in the entire file at one time to process as a single string (there can be a lot of advantages to working that way), but I encountered a Mathworks bug with native2unicode at the point of 1 gigabyte of decoded characters.

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by