Find data from files that are too large to read in

I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want.

2 commentaires

This solution, using the ds = tabularTextDatastore function call worked fo me. The default read frame is 20,000 lines; I got a speed-up by going to 1,000,000 frame size. By putting my code to analyze the data inside a
while hasdata(ds)
end
allowed the transition from code to use a file I could load into memory to one too large to do so.

Connectez-vous pour commenter.

Réponses (2)

Walter Roberson
Walter Roberson le 17 Fév 2024

0 votes

Use buffer-fulls of data for increased efficiency.
fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.
Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.

10 commentaires

Thanks, I will give that a try. How do I determine the size of one input buffer of data to optimize reading?
1 gigabyte buffer is probably fine.
fread appears to read only binary files, while I am reading large, pre-existing ASCII files.
In all modern file systems, ASCII files and binary files are just streams of bytes. ASCII files use either linefeed or carriage-return followed by linefeed to signal the end of a line.
There is no reason you cannot fread() a block of data from an ASCII file. The only consequence is that the end of the block of (fix-length) data might not happen to end in a newline. So you scan backwards from the end of the block looking for the first newline, truncate the block there, and fseek() backwards by the number of bytes you moved backwards.
The result will be a block of characters that has internal newlines (and possibly carriage-returns as well) marking the end of lines. You can process that block as text by any of several different methods, including textscan
@Kevin Lehmann: fread reads text files just fine:
fid = fopen('sample.txt');
txt = fread(fid,[1 Inf],'*char');
fclose(fid);
class(txt)
ans = 'char'
disp(txt)
KEVIN LEHMANN: I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want. WALTER ROBERSON: Use buffer-fulls of data for increased efficiency. fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data. Repeat until you are at the end of file. Be careful because the file might potentially not end in newline. KL: Thanks, I will give that a try. How do I determine the size of one input buffer of data to optimize reading? WR: 1 gigabyte buffer is probably fine. KL: fread appears to read only binary files, while I am reading large, pre-existing ASCII files. WR: In all modern file systems, ASCII files and binary files are just streams of bytes. ASCII files use either linefeed or carriage-return followed by linefeed to signal the end of a line. There is no reason you cannot fread() a block of data from an ASCII file. The only consequence is that the end of the block of (fix-length) data might not happen to end in a newline. So you scan backwards from the end of the block looking for the first newline, truncate the block there, and fseek() backwards by the number of bytes you moved backwards. The result will be a block of characters that has internal newlines (and possibly carriage-returns as well) marking the end of lines. You can process that block as text by any of several different methods, including textscan VOSS: @Kevin Lehmann: fread reads text files just fine: fid = fopen('sample.txt'); txt = fread(fid,[1 Inf],'*char'); fclose(fid); class(txt) disp(txt)
I agree that I could read the data as if a binary file, but then I have to "extract" the numbers from the ASCII character bytes. My data file has fixed record length records of length 37 bytes. Is there a Matlab function that will allow me to do a formated "read" into variable arrays from substrings of an array I read in using the fread function? I could write my own routine but that would be a pain!
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = data(:,1:5, 'evaluation', 'restricted');
second_group = data(:,6:7, 'evaluation', 'restricted');
@Walter Roberson, did you, perhaps, mean this?
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = str2num(data(:,1:5), 'evaluation', 'restricted');
second_group = str2num(data(:,6:7), 'evaluation', 'restricted');
I got my code to work using data = fread(FILEID, [37,1000000], 'int8=>char')' to read from the file, 1 million lines at a time. With the same processing after input, this took a factor of 10 longer ( 50 mins vs 5 mins as reported by tic..toc) compared to using the tabularTextDatastore to read the same data and doing the same processing after input.
Ah, yes, I did mean that!

Connectez-vous pour commenter.

Perhaps memmapfile? I think its purpose is to look at very large files.
help memmapfile
MEMMAPFILE Construct memory-mapped file object. M = MEMMAPFILE(FILENAME) constructs a memmapfile object that maps file FILENAME to memory, using default property values. FILENAME can be a partial pathname relative to the MATLAB path. If the file is not found in or relative to the current working directory, MEMMAPFILE searches down the MATLAB search path. M = MEMMAPFILE(FILENAME, PROP1, VALUE1, PROP2, VALUE2, ...) constructs a memmapfile object, and sets the properties of that object that are named in the argument list (PROP1, PROP2, etc.) to the given values (VALUE1, VALUE2, etc.). All property name arguments must be quoted character vectors or strings (e.g., 'Writable'). Any properties that are not specified are given their default values. Property/Value pairs and descriptions: Format: string scalar or character vector, or Nx3 cell array (defaults to 'uint8'). Format of the contents of the mapped region. If a string or character vector, Format specifies that the mapped data is to be accessed as a single vector of type specified by Format's value. Supported values are 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'single', and 'double'. If an Nx3 cell array, Format specifies that the mapped data is to be accessed as a repeating series of segments of basic types, each with specific dimensions and name. The cell array must be of the form {TYPE1, DIMS1, NAME1; ...; TYPEn, DIMSn, NAMEn}, where TYPE is one of the data types listed above, DIMS is a numeric row vector specifying the dimensions of the segment of data to use, and NAME is a field name to use to access the data (as a subfield of the Data property). See Data property and examples below. Repeat: Positive integer or Inf (defaults to Inf). Number of times to apply the specified format to the mapped region of the file. If Inf, repeat until end of file. Offset: Nonnegative integer (defaults to 0). Number of bytes from the start of the file to the start of the mapped region. Offset 0 represents the start of the file. Writable: True or false (defaults to false). Access level which determines whether or not Data property (see below) may be assigned to. All the properties above may also be accessed after the memmapfile object has been created by dot-subscripting the memmapfile object. For example, M.Writable = true; changes the Writable property of M to true. Two properties which may not be specified to the MEMMAPFILE constructor as Property/Value pairs are listed below. These may be accessed (with dot-subscripting) after the memmapfile object has been created. Data: Numeric array or structure array. Contains the actual memory-mapped data from FILENAME. If Format is a string or character vector, then Data is a simple numeric array of the type specified by Format. If Format is a cell array, then Data is a structure array, the field names of which are specified by the third column of the cell array. The type and shape of each field of Data are determined by the first and second columns of the cell array, respectively. Changes to the Data field or subfields also change the corresponding values in the memory-mapped file. Filename: Char array. Contains the name of the file being mapped. Note that when a variable containing a memmapfile object goes out of scope or is otherwise cleared, the memory map is automatically unmapped. Examples: % To map the file 'records.dat' to a series of unsigned 32-bit % integers and set every other value to zero (in Data and % records.dat): m = memmapfile('records.dat', 'Format', 'uint32', 'Writable', true); m.Data(1:2:end) = 0; % To map the file 'records.dat' to a repeating series of 20 singles % (as a 5-by-4 matrix) called 'sdata', followed by 10 doubles (as a 1-by-10 vector) called 'ddata': m = memmapfile('records.dat', 'Format', {'single' [5 4] 'sdata'; ... 'double', [1 10] 'ddata'}); firstSdata = m.Data(1).sdata; firstDdata = m.Data(1).ddata; See also MEMMAPFILE/DISP, MEMMAPFILE/GET Documentation for memmapfile doc memmapfile

1 commentaire

It appears from what I read that MEMMAPFILE only works for binary files. As I am reading large, pre-exisiting ASCII files, this did not work fo me. If I was generating the files myself, this would probably we a good option, though it also appears that all the data needs to be saved in the same format.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Large Files and Big Data dans Centre d'aide et File Exchange

Produits

Version

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by