Parfor loop is slow

4 vues (au cours des 30 derniers jours)
Letizia Gambacorta
Letizia Gambacorta le 9 Mai 2023
The following code is slower compared to the one without parfor! What is the reason?
numCores = 4; % Specify the number of cores to be used
maxNumCompThreads(numCores); % Set the maximum number of computational threads
.........
%where X,Y,Z are matrixes
Complex_data = zeros(Ns, n_iter);
iterationTimes = zeros(n_iter);
for i = 0:n_iter-1
% where X,Y,Z are matrixes
Z = Z_full(:,i*dec_samples+1:Ny+i*dec_samples);
[ la, lb, Ux, Uy, Uz, R ] = surfaspect( X, Y, Z, x0, y0, z0); % contains only vectorial operations,
% Create a vector to store the simulated echo spectra for all filters
E = facet2ndinterp( la, lb, Ux, Uy, Uz, R, fband + f0 ); % contains only vectorial operations, fband is a 1x1xn vector and E a m x p x n matrix
espectr = sum( sum( E ) );
espectr = espectr(:);
espectr = [espectr( 1 : Nf2 ); zeros(length( f )- Nf,1); espectr( end - Nf2 + 2 : end )];
Complex_data( : , i+1 ) = ifft(espectr);
% Store the execution time of the current iteration
iterationTimes(i+1) = toc;
fprintf('Sounding %d/%d completed. Execution time:%ds\n',i+1,100,iterationTimes(i+1))
end

Réponse acceptée

Walter Roberson
Walter Roberson le 9 Mai 2023
Imagine that you have a team of four people, and you have four moderately heavy objects to move.
Imagine that you first assign the four people to a single team, and the four people together pick up a corner of one of the objects and carry it, then go back for the second object and together carry it, and so on.
Now image that you then switch to assigning one person per object, and each person struggles to move the object by themselves but eventually get all of the objects to the other side, having taken one trip each.
Now, which was faster? The case where the four worked together but had to go back several times, or the case where each struggled to do the work but only one trip each was made?
The answer is not obvious. In the real world, taking time to get back would be a thing; in computers getting back to the beginning is almost but not quite instant (like sliding down a fire pole, fairly fast but not truly immediate.) If each team member can haul the object by themselves at least 1/4 of the speed of the four together then working individually would be faster. But hauling a heavy object by yourself can be a lot slower than having a team haul it with you, so working together can sometimes be much faster.
Now imagine slightly more detail, that at the beginning, you the supervisor have to tell the team members what to do. In the case of working together you can tell the group together once, but in case of working separately you have to instruct them one at a time. Giving detailed instructions takes a while, so if the actual hauling work is fast, the biggest delay can end up being the time required to instruct each person one by one. In computers this situation happens fairly often, that the overhead of giving each worker detailed instructions one by one ends up being much slower than having them work together.
  2 commentaires
Letizia Gambacorta
Letizia Gambacorta le 9 Mai 2023
That's amazing! Thanks for your clear answer. Therefore, you are saying that most likely the instructing time is the one causing the parfor loop to be slower (and this has sense for me as the computation starts after ages). But it seems to me that I cannot do that much to solve this problem in the code, right? Or I can change somehow the sintax?
Walter Roberson
Walter Roberson le 9 Mai 2023
[ la, lb, Ux, Uy, Uz, R ] = surfaspect( X, Y, Z, x0, y0, z0); % contains only vectorial operations,
When you have vector operations on "large enough" arrays, then the work is split between the available cores. Depending on the operations, sometimes a highly efficient multi-threaded library is called.
If you are not dealing with "large enough" arrays or the operations are ones not supported by the efficient library, then the code might end up internally looping using a single core. Depending on the exact operations, MATLAB might be able to take advantage of streaming operations to (for example) request four additions in a single instruction (which can be faster than four seperate additions because four seperate additions would require decoding four instructions instead of one, and the streaming version can take advantage of internal knowledge of exactly when different portions of the core become available.) Whether MATLAB is able to stream instructions on a single core or not, in the situation of "not large enough", it is like having personal-sized boxes to carry: having more workers doesn't help, and the overhead of instructing each worker separately might be worth it in such a case.
The requirements to send the data and instructions to each worker can add up to a lot. You have to have each worker doing a lot relative to the cost of data and instructions and returning results, before using multiple parpool workers becomes more efficient even in the case it would seem "obvious" that more workers should help.
The high performance libraries, working on "large enough" arrays already know how to split up adding two arrays between the available cores -- even sum(ARRAY) gets split up, each core getting a fraction of the array to create partial sums from and then a final pass to add the partial sums. So the automatic multi-core use is more than would seem obvious at first.
In some cases, you can get significant performance improvements by using backgroundPool and parfeval instead of parfor (which defaults to using parpool which uses multiple processes.) Background pools can share memory in some cases, reducing the overhead of transferring data.

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

En savoir plus sur Parallel Computing Fundamentals dans Help Center et File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by