Find data from files that are too large to read in
Afficher commentaires plus anciens
I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want.
2 commentaires
Kevin Lehmann
le 20 Fév 2024
Réponses (2)
Walter Roberson
le 17 Fév 2024
0 votes
Use buffer-fulls of data for increased efficiency.
fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.
Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.
10 commentaires
Kevin Lehmann
le 17 Fév 2024
Walter Roberson
le 18 Fév 2024
1 gigabyte buffer is probably fine.
Kevin Lehmann
le 20 Fév 2024
Walter Roberson
le 20 Fév 2024
In all modern file systems, ASCII files and binary files are just streams of bytes. ASCII files use either linefeed or carriage-return followed by linefeed to signal the end of a line.
There is no reason you cannot fread() a block of data from an ASCII file. The only consequence is that the end of the block of (fix-length) data might not happen to end in a newline. So you scan backwards from the end of the block looking for the first newline, truncate the block there, and fseek() backwards by the number of bytes you moved backwards.
The result will be a block of characters that has internal newlines (and possibly carriage-returns as well) marking the end of lines. You can process that block as text by any of several different methods, including textscan
fid = fopen('sample.txt');
txt = fread(fid,[1 Inf],'*char');
fclose(fid);
class(txt)
disp(txt)
Kevin Lehmann
le 20 Fév 2024
Walter Roberson
le 20 Fév 2024
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = data(:,1:5, 'evaluation', 'restricted');
second_group = data(:,6:7, 'evaluation', 'restricted');
Les Beckham
le 20 Fév 2024
@Walter Roberson, did you, perhaps, mean this?
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = str2num(data(:,1:5), 'evaluation', 'restricted');
second_group = str2num(data(:,6:7), 'evaluation', 'restricted');
Kevin Lehmann
le 21 Fév 2024
Walter Roberson
le 21 Fév 2024
Ah, yes, I did mean that!
Catégories
En savoir plus sur Large Files and Big Data dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!