Using parfor loop to restructure large dataset

2 vues (au cours des 30 derniers jours)
Mackenzie Dughi
Mackenzie Dughi le 9 Mar 2021
Réponse apportée : Alvaro le 19 Jan 2023
I'm working on restructuring particle trajectory data to be used in our analysis. This is a 24Gb file that contains the particle ID and position, time and status. The structure currently looks like the following:
I would like to restructure the dataset so that I have a "traces" structure that has "coos" cells that are indexed by the particle ID. In these cells I want to put each tracjectorie's PositionX, PositionY, PositionZ, Time, and Status.
I wanted to use a parfor loop to cut down on computational time but I keep running into the issue of broadcast variables in the following code:
numID = length(unique(pt.ID));
coos = cell(height(1),width(numID));
parfor i = 1:numID
indx = find(pt.ID(:)==i);
x = pt.PositionX(indx); y = pt.PositionY(indx); z = pt.PositionZ(indx);
t = pt.Time(indx); status = pt.Status(indx);
coos{i}(:,1) = x;
coos{i}(:,2) = y;
coos{i}(:,3) = z;
coos{i}(:,4) = t;
coos{i}(:,5) = status;
where the first line within the parfor loop "pt.ID" is a broadcast variable. If I understand correctly, broadcast variables are in which all workers need access to and therefore hindering computational time. Is there any kind of workaround that allows me to find the indices of each trajectory and populate my traces structure without running into this issue? Additionally, since I'm dealing with such a large dataset, is there anyway to increase the 2Gb array limit that can be passed into parfor?
  1 commentaire
Mohammad Sami
Mohammad Sami le 10 Mar 2021
I dont think it will really speed up the restructuring. Doing it this way will likely create a lot of overhead for moving data between the processes.
Also you dont need to call the find function in the following. Logical indexing should be enough.
indx = pt.ID==i;

Connectez-vous pour commenter.

Réponses (1)

Alvaro le 19 Jan 2023
I agree with Mohammad that I would not expect parallelization to significantly speed up the rearranging of this struct since you are broadcasting not just pt.ID but the entire pt struct to each worker.
If you still wish to use parallelization in a more efficient manner, you might be able to extract the data from your struct and slice it so that each worker only operates with a subset of the data.
Alternatively, if you want to keep your struct intact, you could write a function that performs that manipulations on the struct and pass it to structfun which supports ThreadPool. Then use num2cell on the resulting array.
If the problem is that you want to use MATLAB while rearranging that struct, then you could consider a batch job in your local computer or you could send it to a cluster if you have access to one.
Also, upgrading from MATLAB R2013a should remove the 2 gb limit transfer for broadcast variables.


En savoir plus sur Parallel Computing Fundamentals dans Help Center et File Exchange




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by