Parfor with GPUs crashes

2 vues (au cours des 30 derniers jours)
Anton Baranikov
Anton Baranikov le 27 Fév 2023
Commenté : Raymond Norris le 14 Mar 2023
Hello, everobody!
I have a code, that uses GPUs. I would like to use this code in parallel for different settings, i.e. code(setting=1),code(setting=2),code(setting=3) etc. For that I am implementing a parfor loop on a Linux-based high performance cluster (HPC).:
parfor i=1:N
code(setting=i)
end
However, it often crashes, especially when number of workers N is larger (more than 4-5). Typically, the crash is followed by shutting down Matlab with "Bus error" or "Fatal error" in the terminal.
What I do in general is the following. Firstly, I request the necessary resources: N workers with sufficient memory and a gpu per worker. Then I check that I do have a GPU per worker by :
spmd
gpuDeviceCount
end
After that, I initialzie the parpool with:
c=parcluster;
c.NumWorkers=N;
parpool(N)
And then I run my code. Note that an individual job with one GPU (without parfor loop) works perfectly. Also, it almost always work for 2-3 workers in parallel.
  3 commentaires
Anton Baranikov
Anton Baranikov le 27 Fév 2023
@Raymond Norris, this is the command I use e.g. for 5 workers:
qsub -I -X -lselect=5:ncpus=4:ngpus=1:mem=20gb,software=matlab
yes I do a local pool. I tried to make PBSProProfile but the outcome was the same.
Raymond Norris
Raymond Norris le 14 Mar 2023
This is requesting 5 chunks, with 4 cores and 1 GPU per chunk. But this doesn't ensure that the 5 chunks are on the same node. I also wonder why you're requesting 5 chunks? If you're running a local pool, you only need 1 chunk. Try the following:
qsub -I -X -l select=1:ncpus=4:ngpus=1:mem=20gb,software=matlab
Then in MATLAB run
pctconfig('preservejobs',true);
setenv('MDCE_DEBUG','true')
local = parcluster("local");
pool = local.parpool(4);
% Run your parallel code
If/when the pool crashes,
local.getDebugLog(local.Jobs(end))

Connectez-vous pour commenter.

Réponses (0)

Catégories

En savoir plus sur Parallel Computing Fundamentals dans Help Center et File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by