Effacer les filtres
Effacer les filtres

Please help me create a tall array from a large binary file and "fileDatastore" without running out of memory.

2 vues (au cours des 30 derniers jours)
I have a large data file (the particular file I am working with now is ~60gb, though a few hundred gb's is typical) that I want to create a tall array of. I am hoping this will allow me to quickly perform calculations on the data without loading it into memory. The data is in a custom format, so it seems that I am stuck with using the custom "fileDatastore" format.
Making the datastore is not a problem, but every time I try and load it I run out of pagefile memory (and have already made my pagefile as big as possible on Windows 10). The issue seems to be that Matlab requires temporarily loading the full datastore into memory before the tall array can be made. It would (supposedly) free up the memory after the file was completely read, but it never gets there. This is due to my not being able to find any way to tell the "fileDatastore" to only read part of the data at once. In the other types of datastore there is a "ReadSize" property that seems to do this, but that is missing from fileDatastore's valid options. The @readfcn I am using is setup to [partially read the data correctly (I could easily tell it to read the next X values from the current position), I just dont know how to make fileDatastore pass along a 2nd parameter that has this information (the 1st parameter is the file name).
I imagine I could manually break up the data into separate datastores and then combione them each into the same tall array somehow, but this 1) would be rather tedious to do EVERY time I want to us make a fileDatastore, and 2) I imagine this would negatively impact the delayed execution feature, since (I'd guess) matlab would try any optimize reading the data from each small sub-datastore individually rather than optimizing for the whole data file. As such, I'd much rather find a way to do this from a single fileDatastore.
.
.
PS If any MathWorks staff sees this - please suggest to the development team to fix this. Granted I am using my personal computer for this, not some cluster with a terrabyte of ram, but it is kind of ridiculous that a computer with an i7 + 16gb of RAM and Matlab's "latest and greatest big data solution" can't manage to deal with a ~60gb file without crashing the computer....I cant imagine that it would take someone (who is familiar with the source code) more than a few hours to add an option of "pass this number to your read function to decide how much it should read at a given time" (or something similar).
  1 commentaire
Hatem Helal
Hatem Helal le 6 Déc 2018
Modifié(e) : Hatem Helal le 6 Déc 2018
How are your large binary files generated? It would also be worth evaluating whether you can modify that tools/process to instead create a folder full of large files that represents your large dataset. For example a folder with 60 files that are each ~1GB can be trivially partitioned for parallel analysis. This is a widely used best practice for the storage / representation of large datasets that would let you comfortably analyze your data on your personal computer.

Connectez-vous pour commenter.

Réponse acceptée

Edric Ellis
Edric Ellis le 10 Juil 2017
In R2017a, fileDatastore is currently restricted to reading entire files at a time. This is a known limitation of the current implementation, and this is definitely something we hope to be able to address in a future release of MATLAB. For now, unfortunately the only workaround is to split your data into multiple files so that each file can be loaded without running out of memory. You can use a single fileDatastore instance with multiple data files, as shown in the first example on the fileDatastore reference page.
  1 commentaire
Anthony Barone
Anthony Barone le 11 Juil 2017
Edric,
I appreciate the answer, though I am admittedly disappointed by it. I do hope that this gets implemented in an upcoming release.
I had also considered splitting up the data, resaving it, and loading it, but to be honest I dont think that would be worthwhile. Part of this is the inconvenience of having a 2nd copy of datasets that is effectively useless outside of Matlab (this isnt a huge issue for my current ~60gb file, though this is a trial run...when in full production some of the datasets it will use could easily be 10-20x this size). However, a larger part is my feeling that if something as fundamental as loading data cant be done without these kinda of modifications and workarounds, I can only assume that this project has been put on the "back burner" and as such is really not ready for full production usage. I cant really see relying on hope that there wouldnt be anymore issues, and I imagine that by the time I am able to experiment with it further on my free time to verify this I imagine 2017b will already be put.
At any rate, I very much appreciate the definitive answer. I will keep an eye out in future releases to see if this feature has matured a bit more.

Connectez-vous pour commenter.

Plus de réponses (1)

Hatem Helal
Hatem Helal le 6 Déc 2018
I think this problem could be nicely solved by implementing a custom datastore.

Catégories

En savoir plus sur Large Files and Big Data dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by