Find data from files that are too large to read in

Question

Kevin Lehmann le 17 Fév 2024

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/2083268-find-data-from-files-that-are-too-large-to-read-in

Commenté : Walter Roberson le 21 Fév 2024

I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want.

2 commentaires
Afficher AucuneMasquer Aucune

Stephen23 le 17 Fév 2024

https://www.mathworks.com/help/matlab/import_export/tall-arrays.html

Kevin Lehmann le 20 Fév 2024

This solution, using the ds = tabularTextDatastore function call worked fo me. The default read frame is 20,000 lines; I got a speed-up by going to 1,000,000 frame size. By putting my code to analyze the data inside a

while hasdata(ds)

end

allowed the transition from code to use a file I could load into memory to one too large to do so.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Walter Roberson le 17 Fév 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2083268-find-data-from-files-that-are-too-large-to-read-in#answer_1410863

Use buffer-fulls of data for increased efficiency.

fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.

Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.

10 commentaires
Afficher 8 commentaires plus anciensMasquer 8 commentaires plus anciens

Voss le 20 Fév 2024

Ouvrir dans MATLAB Online

sample.txt

@Kevin Lehmann: fread reads text files just fine:

fid = fopen('sample.txt');
txt = fread(fid,[1 Inf],'*char');
fclose(fid);
class(txt)
ans = 'char'
disp(txt)
KEVIN LEHMANN:
I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields.  I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file).  The program HEX FIEND allows me to do this manually in a small fraction of the time.  Is there a way to read a file up to the point that some condition is met?  If there is, I suspect it will speed up finding and extracting the lines of the file I want.

WALTER ROBERSON:
Use buffer-fulls of data for increased efficiency.
fread() a block of data of fixed size. Scan backwards through the block looking for the last newline,  keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.
Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.

KL:
Thanks, I will give that a try.  How do I determine the size of one input buffer of data to optimize reading?

WR:
1 gigabyte buffer is probably fine.

KL:
fread appears to read only binary files, while I am reading large, pre-existing ASCII files.  

WR:
In all modern file systems, ASCII files and binary files are just streams of bytes. ASCII files use either linefeed or carriage-return followed by linefeed to signal the end of a line. 
There is no reason you cannot fread() a block of data from an ASCII file. The only consequence is that the end of the block of (fix-length) data might not happen to end in a newline. So you scan backwards from the end of the block looking for the first newline, truncate the block there, and fseek() backwards by the number of bytes you moved backwards.
The result will be a block of characters that has internal newlines (and possibly carriage-returns as well) marking the end of lines. You can process that block as text by any of several different methods, including textscan

VOSS:
@Kevin Lehmann: fread reads text files just fine:
fid = fopen('sample.txt');
txt = fread(fid,[1 Inf],'*char');
fclose(fid);

class(txt)
disp(txt)

Kevin Lehmann le 21 Fév 2024

I got my code to work using data = fread(FILEID, [37,1000000], 'int8=>char')' to read from the file, 1 million lines at a time. With the same processing after input, this took a factor of 10 longer ( 50 mins vs 5 mins as reported by tic..toc) compared to using the tabularTextDatastore to read the same data and doing the same processing after input.

Walter Roberson le 21 Fév 2024

@Les Beckham

Ah, yes, I did mean that!

Connectez-vous pour commenter.

Answer 2

Image Analyst le 17 Fév 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2083268-find-data-from-files-that-are-too-large-to-read-in#answer_1410913

Ouvrir dans MATLAB Online

Perhaps memmapfile? I think its purpose is to look at very large files.

help memmapfile
 MEMMAPFILE Construct memory-mapped file object.
    M = MEMMAPFILE(FILENAME) constructs a memmapfile object that maps file FILENAME
    to memory, using default property values. FILENAME can be a partial pathname
    relative to the MATLAB path. If the file is not found in or relative to the
    current working directory, MEMMAPFILE searches down the MATLAB search path.
     
    M = MEMMAPFILE(FILENAME, PROP1, VALUE1, PROP2, VALUE2, ...) constructs
    a memmapfile object, and sets the properties of that object that are
    named in the argument list (PROP1, PROP2, etc.) to the given values
    (VALUE1, VALUE2, etc.). All property name arguments must be quoted
    character vectors or strings (e.g., 'Writable'). Any properties that
    are not specified are given their default values.
 
    Property/Value pairs and descriptions:
 
        Format: string scalar or character vector, or Nx3 cell array (defaults to 'uint8').
            Format of the contents of the mapped region.
 
            If a string or character vector, Format specifies that the
            mapped data is to be accessed as a single vector of type
            specified by Format's value. Supported values are 'int8',
            'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32',
            'uint64', 'single', and 'double'.
 
            If an Nx3 cell array, Format specifies that the mapped data is
            to be accessed as a repeating series of segments of basic
            types, each with specific dimensions and name. The cell array
            must be of the form {TYPE1, DIMS1, NAME1; ...; TYPEn, DIMSn,
            NAMEn}, where TYPE is one of the data types listed above, DIMS
            is a numeric row vector specifying the dimensions of the
            segment of data to use, and NAME is a field name to use to
            access the data (as a subfield of the Data property). See Data
            property and examples below.
 
        Repeat: Positive integer or Inf (defaults to Inf).
            Number of times to apply the specified format to the mapped region of the
            file. If Inf, repeat until end of file.
 
        Offset: Nonnegative integer (defaults to 0).
            Number of bytes from the start of the file to the start of the mapped
            region. Offset 0 represents the start of the file.
 
        Writable: True or false (defaults to false).
            Access level which determines whether or not Data property (see below)
            may be assigned to.
 
    All the properties above may also be accessed after the memmapfile object has
    been created by dot-subscripting the memmapfile object. For example,
 
        M.Writable = true;
  
    changes the Writable property of M to true.
 
    Two properties which may not be specified to the MEMMAPFILE constructor as
    Property/Value pairs are listed below. These may be accessed (with
    dot-subscripting) after the memmapfile object has been created.
 
        Data: Numeric array or structure array.
            Contains the actual memory-mapped data from FILENAME. If Format
            is a string or character vector, then Data is a simple numeric
            array of the type specified by Format. If Format is a cell
            array, then Data is a structure array, the field names of which
            are specified by the third column of the cell array. The type
            and shape of each field of Data are determined by the first and
            second columns of the cell array, respectively. Changes to the
            Data field or subfields also change the corresponding values in
            the memory-mapped file.
 
        Filename: Char array.
            Contains the name of the file being mapped.
 
    Note that when a variable containing a memmapfile object goes out of scope or is
    otherwise cleared, the memory map is automatically unmapped.
 
    Examples:
        % To map the file 'records.dat' to a series of unsigned 32-bit % integers and
        set every other value to zero (in Data and % records.dat): 
        m = memmapfile('records.dat', 'Format', 'uint32', 'Writable', true);
        m.Data(1:2:end) = 0;
 
        % To map the file 'records.dat' to a repeating series of 20 singles % (as a
        5-by-4 matrix) called 'sdata', followed by 10 doubles (as a 1-by-10 vector)
        called 'ddata': 
        m = memmapfile('records.dat', 'Format', {'single' [5 4] 'sdata'; ...
                                                 'double', [1 10] 'ddata'});
        firstSdata = m.Data(1).sdata; firstDdata = m.Data(1).ddata;
 
    See also MEMMAPFILE/DISP, MEMMAPFILE/GET

    Documentation for memmapfile
       doc memmapfile

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Kevin Lehmann le 20 Fév 2024

It appears from what I read that MEMMAPFILE only works for binary files. As I am reading large, pre-exisiting ASCII files, this did not work fo me. If I was generating the files myself, this would probably we a good option, though it also appears that all the data needs to be saved in the same format.

Connectez-vous pour commenter.

Find data from files that are too large to read in

2 commentaires
Afficher AucuneMasquer Aucune

Réponses (2)

10 commentaires
Afficher 8 commentaires plus anciensMasquer 8 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Find data from files that are too large to read in

2 commentaires Afficher AucuneMasquer Aucune

Réponses (2)

10 commentaires Afficher 8 commentaires plus anciensMasquer 8 commentaires plus anciens

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

2 commentaires
Afficher AucuneMasquer Aucune

10 commentaires
Afficher 8 commentaires plus anciensMasquer 8 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens