Read big file with mixed data types with datastore

I've got a file which is 300 GB big. A piece of it can be found in the attached file. I've read that the best way to handle this kind of files is to read them into a datastore.
As you can see, the first two lines are characters, while the following lines are a combination of floats and integers. Is it possible to read them predefined? I know from fscanf that you can specify the data type, but when I do datastore it interprets every line as a string.

Réponses (1)

ds = datastore('./*.txt', 'Type','tabulartext', 'NumHeaderLines',2, 'TextscanFormats',repmat("%f",1,5));
T = preview(ds)
T = 8x5 table
Var1 Var2 Var3 Var4 Var5 ____ _______ _______ _______ ______ 192 0 0 0 NaN 108 0.21721 0 0 NaN 108 0 0.21721 0 NaN 108 0 0 0.21721 NaN 8 0 17.09 2.3461 1.2766 8 0 21.968 21.103 17.839 8 0 14.849 17.511 11.303 8 0 22.723 23.318 13.066

7 commentaires

is it also possible to store the first two lines as characters? I want to use '-Quickstep-' and 'Spin Density' again to print it into a new file.
fid = fopen('test.txt','rt');
hd1 = fgetl(fid)
hd1 = '-Quickstep-'
hd2 = fgetl(fid)
hd2 = 'SPIN DENSITY'
fclose(fid);
Thanks but i can't use fopen bc my ram is smaller than the file.
Stephen23
Stephen23 le 29 Nov 2024
Modifié(e) : Stephen23 le 29 Nov 2024
FOPEN does not read a file into RAM.
Of course the details are likely more nuanced than that, possibly a small part of the file is loaded and other parts in virtual memory. But in any case, I doubt that there is any implementation of FOPEN in any language that would load an entire file when FOPEN is called. That would be a terrible way to implement FOPEN.
i can't use fopen bc my ram is smaller than the file.
Replace
fid = fopen('test.txt','rt');
with
fid = fopen('test.txt','rt','n','US-ASCII');
The fact that you supplied the text encoding will keep the first fgetl() from scanning through the file trying to guess the file encoding. It will just leave the file positioned at the beginning, ready to read piece by piece. It will not need to buffer the file in memory.
Sy Dat
Sy Dat le 18 Déc 2024
Ah okay, so fopen is able to read a 300 GB file? So did I understood the documentation wrong? Is there a case where you would actually use datastore over fopen?
I do not know what documentation you are referring to?
The documentation for fopen() says "If you do not specify an encoding scheme when opening a file for reading, fopen uses auto character-set detection to determine the encoding." . Details about the auto detection are left unspecified, so hypothetically it might have to scan through the entire file (just in case somewhere in the file there are some utf8 sequences.) But no auto-detection is done if you specify a text encoding.
datastore is good for processing lots of line-oriented data, as datastore can automatically break line-oriented files up into pieces for processing chunks. But the processing would have to be such that it made sense to do the task in chunks -- for example if the processing required calculating the standard deviation of the first column of data, then all of the data would have to be read in first.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Large Files and Big Data dans Centre d'aide et File Exchange

Produits

Tags

Question posée :

le 25 Nov 2024

Commenté :

le 19 Déc 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by