Open parallel pool within workers / Hierarchichal parallel runs

Good day,
I have a numerical simulation that runs in parallel with multiple workers. Now I'm trying to optimize the input parameters by running this simulation with the genetic algorithm (ga) tool. So far, so good. This works.
The thing is that if I try to run the genetic algorithm in parallel using:
options = optimoptions('ga','UseParallel',true);
there's the following error:
A parallel pool cannot be started from a worker, only from a client MATLAB.
If I comment out the parpool call inside the objective function (in the simulation) I see that the ga is effectively running multiple times, but I can also see that the objective function is running only in one worker/core.
Is it possible to assign workers to the ga (for example 5) and then assign 8 workers to each ga run? (meaning a total of 5x8=40 cores?) . So this will have 5 ga runs in parallel with each run using 8 workers to solve the numerical simulation.
Thank you for your help in advance!
Best
Sebastian

 Réponse acceptée

Matt J
Matt J le 26 Juil 2022
Modifié(e) : Matt J le 27 Juil 2022
So, first of all, you shouldn't be calling parpool inside your fitness function. That should happen before the optimization starts.
You cannot open hierarchical pools, but if you set UseVectorized=true (instead of UseParallel),
options = optimoptions('ga','UseVectorized',true);
then you can split both the population members and other parallel tasks performed on them across a single pool inside your fitness function. It could like the following, as an example:
parpool(40)
x=ga(@fitnessFcn,...,options)
function f=fitnessFcn(X)
numTasks=8;
numPop=size(X,1);
[I,J]=ndgrid(1:numPop,1:numTasks);
contribution=zeros(size(I));
parfor n=1:numel(I)
i=I(n); %index of i-th popoulation member
j=J(n); %index of j-th task
x=X(i,:); %i-th population member
contribution(n)=fitnessSubfunction(x,j); %fitness value of combination (i,j)
end
f=sum(contribution,2);
end

7 commentaires

To expand slightly: UseVectorized is not the same as UseParallel.
Vectorized tells ga to pass in an array of coordinates to the function. If the function happens to be written using vectorization and the arrays are large enough then matlab will potentially use the high-performance multithreaded libraries to evaluate. But remember that those are not used for "small" calculations, since it takes resources to set up the calls to the libraries.
Also, by default each parallel worker is only allocated a single core; you can configure more cores per worker.
Thank you very much @Matt J and @Walter Roberson for your replies.
I think I understand your replies. Then, to use 'useVectorized' I would need to adapt the fitness function to be able too handle more than one point at a time.. right? It would receive a set of points (a population) and then the function itself would need to handle this. This could work but somehow seems more laborious.
Regarding what @Walter Roberson says, about assigning more cores per worker.... If this is possible, can I do Looping Over a Distributed Range using more cores? The fintess function does:
spmd
for N_count_1=drange(1:N_rays)
% here a ray tracing traces rays until absorption
end
labBarrier
absorption_counter=gplus(absorption_counter);
end
Matt J
Matt J le 27 Juil 2022
Modifié(e) : Matt J le 27 Juil 2022
Then, to use 'useVectorized' I would need to adapt the fitness function to be able too handle more than one point at a time.. right?
Yes, but that mostly means writing a loop that calls your original fitness function (the fitnessSubfunction in my posted example).
Can I do Looping Over a Distributed Range using more cores? The fintess function does:
With the scheme I outlined, a for...drange loop should also work.However, it's usually recommended that you use parfor loops, since they have more optimized load balancing.
Regarding what @Walter Roberson says, about assigning more cores per worker.
That was just a side note to what we've been discussing. I don't think ascribing multiple cores to a worker means that the worker can create its own pool. The error message you posted was pretty clear about that. It just means that Matlab can incorporate more of its own internal multithreading of operations that it always does when poolsize=1.
On the matter of cores per worker:
Suppose that you have a large 2d array and you use sum() of the array, intending to sum along the rows (giving back a row vector.) For a sufficiently large array, MATLAB would normally call into a high-performance mathematical library to do the work, and that library would automatically divide the summation into pieces, one piece per available core; each core would do part of the summation. With sufficient cores it might even divide the rows into sections, so that one core might (for example) do the first half of each column and another core might do the second half of each column, and then a post-processing pass would merge the two results together. This can result in slightly different results than if the complete column were processed in sequence.
The default configuration for parpool gives one core to each worker, so the high performance routines might end up called, but one core ends up doing all of the work.
You can edit the cluster profile to give more cores per worker (at the expensive of having fewer available workers.)
More cores per workers helps for mathematical operations for "sufficiently large" tasks. Operations such as eigenvalues, or the \ operator can potentially be done in pieces on different cores, with shorter elapsed time. But when the tasks are too small, invoking the high-performance libraries would have too much overhead and sequential routines are used instead. And of course, the libraries only cover some of the potential operations. MATLAB can sometimes analyze loops and see that the loop pattern matches common tasks that can be done with higher performance, but in other cases loops end up needing to be done sequentially, using only one core.
When you use 'UseVectorized' then a population is passed to your cost function. If your operations are mathematical in nature, you might well be able to calculate the cost in a vectorized manner, just by paying a small bit of attention to the orientation of the data and taking advantage of implicit expansion. Calculations that at first look like they need a loop can sometimes be handled for the entire population at the same time by popping into one extra dimension.
If you find that you end up handling 'UseVectorized' by looping or arrayfun calling your original function one population member at a time, then you should not be using 'UseVectorized'.
If your calculations are not working on larger arrays, or your calculations are not doing more complex operations such as eigenvalues or matrix multiplications, then you should probably not be allocating multiple cores for each worker. If you just have long expressions like lots of scalar multiplications and trig calls on scalars and so on, then that is not something that automatic allocation to multiple cores can execute more quickly.
Let me start by thanking both of you for your time and your clear and complete answers! :)
Before I implement this, I am trying to figure out how something like this works.
The fitness function is a ray tracer, meaning there's a massive amount of simple math calculations (additions, multiplications, and access and read a value in a large 3D matrix). The calculations are also mostly scalar or doubles. The array is only accessed to check a value (voxel value more precisely), so I don't think MATLAB will call any high-performance library. Also, I don't see an option to add a dimension and "vectorize" the ray tracer, allowing it to work with more than one sizeable 3D matrix at a time (the ga input parameter is either a large matrix that represents the scene/structure being ray traced or a parametrized form of this: 8-10 parameters defining the shape of a large matrix). So I don't know how passing a population to the ray tracer could work, apart from running the different scenes/structures in a standard loop, which would run them in series, and then I don't see the benefit.
On the other hand, the fitness function right now is able to trace more than one ray at a time using parpool. If I'm using 8 workers, 8 rays are traced. When the 8 rays have been fully traced, a value is updated for all workers and then the next 8 rays are traced. So the loops are not independent, which is why I don't use parfoor but the spmd + drange + labBarrier +gplus configuration. In this sense, I don't see how @Matt J example's can work because I cannot run different fitnessSubfunction(x,j) for each j-th task as these tasks are dependent. I actually see a case where I can completely decouple the loops (making them independent) but then I would loose some functionality. If there's no other way I would pursue this.
So I cannot see a way to "parallelize" this in any other form than running same input, one element of the population, by tracing paralell rays within each fitness function.
Maybe I did not fully understand your comments, but up to know my take-home message would be that if I decouple the loops (making them fully independent), I can run in parallel by:
options = optimoptions('ga','UseVectorized',true);
And running the different (population members, task) combinations in a parpool called within the fitness function. I hope I caught it :s
Again, thank you very much!
Sebastian
If the output for one population member depends upon which group it is in, or upon what has already been calculated for other population members, then ga() is not a suitable program. ga() depends upon the calculations for population members being independent. The output for one population member must be decoupled from the output for any other population member. If ga() re-evaluates a population member later, it must get the same result it got before (so no randomness either.)
UseVectorized would be for cases where you were somehow able to do the ray tracing for several population members in parallel. Those "massive amounts of simple math calculations" can potentially be done with respect to more than one population member at a time.
But that depends on what it means to mutate the parameters. Given two different population members, is the same amount of work done for the calculations ? For example if one of the parameters controls the "size" of an object, do you keep the same relative calculation resolution anyhow, or is the calculation resolution absolute? Like size 2 vs size 3, fixed calculation resolution 0.01 so array size 200 in the first case, 300 in the second case? Or fixed at (say) 100 points in the dimension, so absolute size 2/100 in the one case, absolute size 3/100 in the other case?
If the different parameters are used to generate a 3D array, and the array is always the same size, then perhaps you can make it a 4d array, one 3D slice for each population member ? If, though, the 3d arrays are not always the same size for different population members, then vectorization is not going to work.
Sebastian Sas Brunser
Sebastian Sas Brunser le 28 Juil 2022
Modifié(e) : Sebastian Sas Brunser le 28 Juil 2022
The output for one population member must be decoupled from the output for any other population member.
This is the case. The output for one population member is completely independent from each other. Also it is not random, same member will produce (within a small statistical error) same result. However, each "task" as defined by @Matt J will not have independent results. This is "the second level paralellization" I was looking for. Each ray traced might depend on the others. I can decouple this, but there's still there's a shared number that needs to be updated at the end. This shared value I can't have for each ray (>1E5) and then sum it, I need to have it per worker and then sum it up.
UseVectorized would be for cases where you were somehow able to do the ray tracing for several population members in parallel.
I understand. Maybe I should just run the fitness function in a single worker and just paralellize the ga() function only ('useParallel', true). This was the original case. I can run up to 40 population members in parallel, but each fitness function will be run in a single worker. Good enough.
If the different parameters are used to generate a 3D array, and the array is always the same size, then perhaps you can make it a 4d array, one 3D slice for each population member ?
Yes, this could work. The resolution is fixed, however the amount of calculations change. Different population members will have different calculation times.
Thank you very much. I will try both approaches:
Trying to vectorize the fitness function, and
Just paralellizing the ga() and running the fitness function in one worker.
Thank you very much.

Connectez-vous pour commenter.

Plus de réponses (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by