Why workers keep aborting during parallel computation on cluster?
Afficher commentaires plus anciens
I keep getting the warning
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]
when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.
I would like to know the reason of this issue, and is there a way to avoid it ?
Thanks.
19 commentaires
Mario Malic
le 7 Déc 2020
Whhat kind of simulation?
Muh Alam
le 7 Déc 2020
Kojiro Saito
le 8 Déc 2020
matlab_crash_dump files might be stored in JobStrageLocation of parallel workers.
c=parcluster();
c.JobStorageLocation
Muh Alam
le 9 Déc 2020
Kojiro Saito
le 9 Déc 2020
Does your code have file I/O? For example, save.
Parallel workers might crash if multiple workers try to write to the same file.
Muh Alam
le 9 Déc 2020
Kojiro Saito
le 10 Déc 2020
No, I meant save inside parfor loop. But you're using save after parfor loop, it's safe.
Did you try changing SpmdEnabled option to false?
parpool('SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
le 10 Déc 2020
Kojiro Saito
le 10 Déc 2020
OK. Does this occur if you require smaller wokers?
Such as,
parpool(2, 'SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
le 10 Déc 2020
Kojiro Saito
le 11 Déc 2020
Does your cluster have enough resource?
If Linux, from Terminal
ulimit -a
provides the resouce (max processes etc.).
Muh Alam
le 14 Déc 2020
Muh Alam
le 3 Fév 2021
Kojiro Saito
le 3 Fév 2021
I don't think so. I think it is an usual script.
Are you able to check the SLURM's log file?
Kojiro Saito
le 4 Fév 2021
I understood. It was related to memory error. As you mentioned, increasting the allocated memory such as "--mem-per-cpu=2G" in sbatch option would solve.
Muh Alam
le 6 Fév 2021
Kojiro Saito
le 7 Fév 2021
Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
Muh Alam
le 8 Fév 2021
Réponse acceptée
Plus de réponses (0)
Catégories
En savoir plus sur Third-Party Cluster Configuration dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!