How can you set up very large Tall Arrays without Running into swap/page-file issues

6 vues (au cours des 30 derniers jours)
I recently tried setting up a tall array, using the following (approximate) code:
ds=fileDatastore(filename,'ReadFcn',@mydataload);
dataTall=tall(ds);
The data file type is not one that Matlab can natively handle, so I am using my own data reading function.
  • The data is binary IBM floats, and represents a 2D array where the first few bytes in every column are a header and the rest is data
  • I generally set up the loading to load a few columns at a time, but I wasn't sure how to get "fileDataSstore to do this, so it is set to load the whole thing (headers are skipped over during the read process)
I have set up the pageful for my system to be as large as possible (3x the RAM, I ran this test on Windows though I will also be running things on Linux / CentOS). Unfortunately, while I have enough RAM to make this work, I get a "out of pagefiles" error that forces a system reboot before the "tall(ds)" command is finished running.
Can someone please tell me what I am doing wrong, and how to fix this? I REALLY hope that TMW didn't decide to make this "big data" inspired function such that it was limited by pageful space, since that only gives a 100-200% maximum capacity boost versus having everything in RAM. I mean, 2-3x improvement is better than nothing, but it doesn't even come close to being a feasible solution for most "big data" analysis...
Thank you in advance!

Réponses (2)

Hatem Helal
Hatem Helal le 6 Déc 2018
I think this problem could be nicely solved by implementing a custom datastore. The main idea is your datastore will need to know how to incrementally/partially read your large binary file. You'll need to consider how to partition reading these files if you are looking to use tall arrays with parallel computing toolbox. A typical strategy is to partition based on byte offsets. This makes it easier to implement the partition method of the matlab.io.datastore.Partitionable interface but requires that your reader knows how to seek to the first complete record/row of your dataset.

Edric Ellis
Edric Ellis le 4 Juil 2017
Firstly, tall arrays are definitely not required to fit into RAM, or swap space, or anything like that. It is perfectly possible to use tall arrays with collections of data that are 100s or 1000s of GB in size. tall arrays are processed out-of-core by reading in files or portions of files at a time.
I suspect the problem in your case is that fileDatastore is designed to read whole files at a time (whereas the datastore instances used with e.g. tabular text files know how to read portions of files at a time). So, if you have just a single huge file that you're using with fileDatastore, then this is likely to be the cause of your difficulty. If you can partition your input data file somehow and then use that with your fileDatastore, things should work better.
  1 commentaire
Anthony Barone
Anthony Barone le 6 Juil 2017
Thank you for the response Edric. I had suspected the issue was due to the reading function.
With regards to using another datastore type - The data is a single binary file that is split up into a massive 2D array. There is a short header at the start of the file and at the beginning of each column of data. Unfortunately, the data is in IBM floating point format. I wrote/found some code that performs the IEEE2IBM and IBM2IEEE conversion (quite quickly if I do say so myself), so I can load the datajust fine, but unless one of the other datastore types can be made to accept IBM floats as an input then I am stuck using fileDatastore.g
With regards to splitting up the file - my code already has the ability to load a single column of data at a time (or a few columns or the whole thing, you just tell it the column indices to fetch from the data array). The issue is getting fileDatastore to use it. The only documentation I could find only shows reading the whole data file at once. Is there some way to put the fileDatastore call in a loop and have the data just added to the end of the datastore automatically? I didnt find any examples doing this, but I havent had the time to experiment with trying it out myself so idk if it is as easy as I just described.
While I have you, I had a quick question about tall arrays. I'm mainly trying to get this to work because the delayed execution feature sounds useful. My question is what exactly does and doesnt get optimized though before runtime. It says it optimizes to reduce how many times the data needs to be read but will it do things like auto-vectorization (by that I mean more than the JIT compiler already does, since it has more time to prepare the optimization)? If there any way to control how well it tries to optimize? Its mostly curiousity, but I also want to know what I can and cant expect it to do for me,
Thanks!

Connectez-vous pour commenter.

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by