Importing a Large CSV in chunks and getting Data Range Invalid

1 vue (au cours des 30 derniers jours)
John Gagnon
John Gagnon le 25 Avr 2019
I have a large CSV File with 17+million rows of data. My current code is importing 100,000 lines at a time and processing that data down than overwritting that raw data for the next 100k in an attempt to not run out of memory.
for i=0:100000:10000000
start_row=2+i;
end_row=100000+i;
RangeString = sprintf('B%d:O%d', start_row, end_row);
[num,text,raw] = xlsread('ABC.csv', 1, RangeString);
%%Process Data%%
End
The End Result anytime I try import a line over 1m.
Data range 'B1000002:O1100000' is invalid.
Error in DataAnalysis (line 13)
[num,text,raw] = xlsread('ABC.csv', 1, RangeString);
What is the better way to import and process this data?
  6 commentaires
Walter Roberson
Walter Roberson le 25 Avr 2019
You might want to look at https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.tabulartextdatastore.html -- though it appears to use textscan() in part.
Walter Roberson
Walter Roberson le 25 Avr 2019
For textscan() you specify a format in a fscanf() like way. You need one format specifier for each input field, and if you do not want the field returned you can use the * modifier:
%f%*s%f
=> read a number to be returned, read a string but discard it, read a number to be returned
Normally, textscan returns a cell array in which there is one element for each format specifier (that is not marked to discard.) In most uses, that means a cell array that contains one cell per column of input (that you want to keep.) The CollectOutput option triggers merging the output for adjacent elements that use the same datatype.
Dates can be read in as dates with a %D specifier. However, it is tricky to get it to scan a date with a space in it as a single field; it is often easier to scan a date with a space in it as a date field and then a duration field, and add the duration to the date afterwards.

Connectez-vous pour commenter.

Réponses (2)

Bob Thompson
Bob Thompson le 25 Avr 2019
Modifié(e) : Bob Thompson le 25 Avr 2019
You can call headerlines to specify a starting point with textscan, but I don't know if it is possible to specify an ending point, which is part of why I suggested dlmread.
Because you have dates, another option you can try is to read each line as a string of text, and then split the values with regexp. That might look something like this.
fid = fopen('mydata.csv');
line = fgetl(fid);
count = 1;
while ~isnumeric(line)
if rem(count,100000)==0; % Look for
tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens'); % Sample only, should return four cells with data
^ ^ ^ ^
Interger Integer DateTime Integer
data(count,:) = [tmp{:}]; % May need to go one level deep {1}{:}, regexp is weird
% Put what you want to do with each block of 100k data lines here. Not really sure what you want to do with it
count = 0;
else
tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens');
data(count,:) = [tmp{:}];
end
line = fgetl(fid);
count = count + 1;
end
This is not a perfect code, and may require you to learn some about regexp to get the formatting correct. I was also unsure what exactly your format looks like.
Alternatively, you could try reading with dlmread, and then concatenating several columns into one for a date time, either using datetime, or just by concatenating.
Keep in mind that no matter which method you use that you will have a set of data with mixed classes, so you will either need to work with cells, a table, or a structure.
  2 commentaires
John Gagnon
John Gagnon le 25 Avr 2019
Thank you. I will take a look at regexp and see if I can get this to work!
Walter Roberson
Walter Roberson le 25 Avr 2019
For textscan() you specify a repeat count immediately after the format string.

Connectez-vous pour commenter.


Jeremy Hughes
Jeremy Hughes le 25 Avr 2019
I'd reccomend using tabularTextDatastore for this case. It will try to automatically detect formats and handle the reading of sections of the file for you.

Produits


Version

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by