how to read data from desired lines of a large data set?

1 vue (au cours des 30 derniers jours)
George
George le 5 Oct 2012
Dear all, I want to read desired lines from a large data set(>50GB) which is not possible to load all the data by simply invoking textscan.
what I can think is:
fid = fopen('data.dat');
nline = 0; % the line index
wline = 1000: 10^7; % the wanted lines
i = 1; % index for wline;
while ~feof(fid)||nline<max(wline)
ldata = fgets(fid);
nline = nline+1;
if nline == wline(i)
datas(i) = ldata;
i= i+1;
end
end
as you see, this loop is really time consuming. my questions is: 1. is there any function to read it faster (on Unix system) 2. is it possible to use pointer, so that just read the desired line
thank you
George
dataset 10^9 lines and 4 columns
0 0 0 0.5
0 0.05 200.05 1 ...

Réponses (1)

José-Luis
José-Luis le 5 Oct 2012
Modifié(e) : José-Luis le 5 Oct 2012
That is one big chunk of data. I have several suggestions:
  • Preallocate: in your code your are growing datas at each iteration. Preallocate using, e.g.
datas = ones(numLines,5);
This might not be a viable option if you want to allocate for a 10^9 x 5 matrix.
  • Split your data in several chunks, that you can read when needed. Look at the split utility
  • Use a database.
If you want to read just one line, and know the exact position (in bytes from the beginning), you could always try fseek.
  2 commentaires
George
George le 5 Oct 2012
thank you for your helpful suggestions, José.
the problem is that bytes are changing line by line. which make it difficult to calculate the exact position.
again, thank you. George
José-Luis
José-Luis le 5 Oct 2012
My pleasure.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Large Files and Big Data dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by