How can I save multiple variables simultaneously

47 vues (au cours des 30 derniers jours)
abdul haj
abdul haj le 28 Oct 2020
Commenté : Walter Roberson le 28 Oct 2020
Hello,
I have a code that runs in a loop, and multiple large variables that I want to save after each iteration (loop).
As these variables are large, it takes a huge amount of time to save them one after another.
My question is whether there is a way that I can make matlab save these variables simultaneously in parallel so I can save some time?
Another idea, is there a way to make matlab continue to the rest of the code while saving the variables?
Thanks in advance

Réponses (1)

Walter Roberson
Walter Roberson le 28 Oct 2020
If you are saving variables into traditional .mat files (-v7 and before), then MATLAB first collects the list of names of variables to be saved. Then it starts at the first one in the list, and serializes the variable into memory. It then compresses the variable in memory using PKZIP type algorithms. If the compressed version is larger than the uncompressed version, then it writes the uncompressed version to the file, and otherwise (more typical) it writes the compressed version.
The save procedure cannot save two variables to the file at the same time because it does not know the size on disk that will be required until the compression in memory is finished.
The file format requires that all the information for one variable be consecutive in the file. The file format does not keep a directory of which disk block corresponds to which file, so it is not able to have two different save processes writing to the file at the same time, alternating blocks or anything like that.
I might have some of the steps not quite right. For example in theory for the last variable in the file, it could write the uncompressed version to disk and then compress in memory and write the compressed version. This is not likely to be the case, however, as it would require updating blocks on disk twice, and the disk access is the slowest part of the process by far (compressing in memory is much faster than writing to disk.)
So that is talking about the difficulty in updating the same file with multiple variables, effectively needing to be sequential. How about writing to different files?
Well, as mentioned there are two in-memory phases (serialization and compression), and one disk phase. During the two in-memory phases, the disk is idle, and could potentially be used by a different process that happens to be ready to go. However, once you have two different processes both ready to write to the same drive, then the two are fighting for drive access. The operating system and disk controller mediate that to some degree, but except for very high end drives, there is only one physical channel to any given drive, and when it is full servicing one process it cannot service another. You get better performance if the two processes are writing to different drives, even if the drives are connected to the same controller. But again there is only one path to the controller and that gets full, so really you can only get full performance if you have different drives on different controllers.
Because there is some opportunity to use drive resources while a process is busy doing something else, typically the highest performance is when there are two processes working with the same drive (and if the controller is spreading the workload out between multiple drives on the same controller using a RAID configuration.)
But that logic about two processes accessing the drive at the same time does, for MATLAB purposes, effectively require different processes. There is no easy way to tell MATLAB to just do the save asynchronously or anything like that. You can, though, use the Parallel Computing Toolbox, which uses different processes. There would be overhead of sending the variables to the other process in order for the other process to save to disk, and depending how you arrange the hardware, it would be common for it to be slower overall to work this kind of way, because of the cost of transfering the data between processes. But when I say "common", I should make it clear that with tuned drives and OS and careful choice of destinations, that there can be benefits to working this way.
But first do the obvious optimizations, such as writing to SSD -- remembering that there might be time to migrate from SSD to traditional drives after the run is over.
Secondly, if you have reason to believe that your variables are not compressible, and if the variables are numeric (or text), then consider using fopen/fwrite/fclose to write them to individual files instead of save, thus reducing the processing overhead. In some cases, variables do not compress well with PKZIP level 3 (which MATLAB uses) but do compress with with level 9 (which a good compressor can use), so it can sometimes be useful to write raw to disk and post-process when it is time to archive to disk. But do keep in mind that disk can be a lot slower than processing in memory, so the extra overhead of compressing in memory can be well worth it because it leads to much less disk activity.
  2 commentaires
abdul haj
abdul haj le 28 Oct 2020
Thanks for your detailed answer!!
The variables that I want to save are saved to separate .mat files, each variable is saved to a separate .mat file.
The variables are above 10GB each (reaching above 20GB sometimes), so they are saved in mat file -v7.3.
And I am already using an SSD (samsung SSD 970 Evo Plus), and this is the only drive I'm using right now.
Is there a better idea than using parfor loop that saves the variables in parallel?
Walter Roberson
Walter Roberson le 28 Oct 2020
Unfortunately I am not nearly as familiar with the internal organization of HDF5 files such are used by -v7.3 . I believe they are better able to handle updates .
I see that some amount of thread safety can be compiled in to the library; https://support.hdfgroup.org/HDF5/faq/threadsafe.html (possibly old material, says that access is serialized) and https://support.hdfgroup.org/HDF5/doc/TechNotes/ThreadSafeLibrary.html (more recent technical notes, does not actually contradict the possibility that access is serialized.)
The HDF5 libraries can be much much slower than MATLAB's -v7 format (but of course -v7 does not permit large objects.)
My understanding is that the HDF5 libraries are especially challenged by arrays of structured information (such as non-scalar struct array, or cell array) -- challenged in the sense of performance difficulties, as there is not a clean mapping on to the lower-level HDF5 facilities, so MATLAB ends up having to create internal variables and refer to them and similar nonsense.
Because of this, I would recommend using only numeric arrays if practical -- and that it might possibly be a notable performance benefit to split structured datatypes up into different variables.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Low-Level File I/O dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by