Please help me create a tall array from a large binary file and "fileDatastore" without running out of memory.

Question

Anthony Barone le 8 Juil 2017

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/347821-please-help-me-create-a-tall-array-from-a-large-binary-file-and-filedatastore-without-running-out

Modifié(e) : Hatem Helal le 6 Déc 2018

I have a large data file (the particular file I am working with now is ~60gb, though a few hundred gb's is typical) that I want to create a tall array of. I am hoping this will allow me to quickly perform calculations on the data without loading it into memory. The data is in a custom format, so it seems that I am stuck with using the custom "fileDatastore" format.

Making the datastore is not a problem, but every time I try and load it I run out of pagefile memory (and have already made my pagefile as big as possible on Windows 10). The issue seems to be that Matlab requires temporarily loading the full datastore into memory before the tall array can be made. It would (supposedly) free up the memory after the file was completely read, but it never gets there. This is due to my not being able to find any way to tell the "fileDatastore" to only read part of the data at once. In the other types of datastore there is a "ReadSize" property that seems to do this, but that is missing from fileDatastore's valid options. The @readfcn I am using is setup to [partially read the data correctly (I could easily tell it to read the next X values from the current position), I just dont know how to make fileDatastore pass along a 2nd parameter that has this information (the 1st parameter is the file name).

I imagine I could manually break up the data into separate datastores and then combione them each into the same tall array somehow, but this 1) would be rather tedious to do EVERY time I want to us make a fileDatastore, and 2) I imagine this would negatively impact the delayed execution feature, since (I'd guess) matlab would try any optimize reading the data from each small sub-datastore individually rather than optimizing for the whole data file. As such, I'd much rather find a way to do this from a single fileDatastore.

.

PS If any MathWorks staff sees this - please suggest to the development team to fix this. Granted I am using my personal computer for this, not some cluster with a terrabyte of ram, but it is kind of ridiculous that a computer with an i7 + 16gb of RAM and Matlab's "latest and greatest big data solution" can't manage to deal with a ~60gb file without crashing the computer....I cant imagine that it would take someone (who is familiar with the source code) more than a few hours to add an option of "pass this number to your read function to decide how much it should read at a given time" (or something similar).

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Hatem Helal le 6 Déc 2018

Modifié(e) : Hatem Helal le 6 Déc 2018

How are your large binary files generated? It would also be worth evaluating whether you can modify that tools/process to instead create a folder full of large files that represents your large dataset. For example a folder with 60 files that are each ~1GB can be trivially partitioned for parallel analysis. This is a widely used best practice for the storage / representation of large datasets that would let you comfortably analyze your data on your personal computer.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Edric Ellis le 10 Juil 2017

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/347821-please-help-me-create-a-tall-array-from-a-large-binary-file-and-filedatastore-without-running-out#answer_273489

In R2017a, fileDatastore is currently restricted to reading entire files at a time. This is a known limitation of the current implementation, and this is definitely something we hope to be able to address in a future release of MATLAB. For now, unfortunately the only workaround is to split your data into multiple files so that each file can be loaded without running out of memory. You can use a single fileDatastore instance with multiple data files, as shown in the first example on the fileDatastore reference page.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Anthony Barone le 11 Juil 2017

Edric,

I appreciate the answer, though I am admittedly disappointed by it. I do hope that this gets implemented in an upcoming release.

I had also considered splitting up the data, resaving it, and loading it, but to be honest I dont think that would be worthwhile. Part of this is the inconvenience of having a 2nd copy of datasets that is effectively useless outside of Matlab (this isnt a huge issue for my current ~60gb file, though this is a trial run...when in full production some of the datasets it will use could easily be 10-20x this size). However, a larger part is my feeling that if something as fundamental as loading data cant be done without these kinda of modifications and workarounds, I can only assume that this project has been put on the "back burner" and as such is really not ready for full production usage. I cant really see relying on hope that there wouldnt be anymore issues, and I imagine that by the time I am able to experiment with it further on my free time to verify this I imagine 2017b will already be put.

At any rate, I very much appreciate the definitive answer. I will keep an eye out in future releases to see if this feature has matured a bit more.

Connectez-vous pour commenter.

Answer 2

Hatem Helal le 6 Déc 2018

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/347821-please-help-me-create-a-tall-array-from-a-large-binary-file-and-filedatastore-without-running-out#answer_350868

I think this problem could be nicely solved by implementing a custom datastore.

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Please help me create a tall array from a large binary file and "fileDatastore" without running out of memory.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Plus de réponses (1)

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

Please help me create a tall array from a large binary file and "fileDatastore" without running out of memory.

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Plus de réponses (1)

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens