Problems with training Network with multi-gpus
Afficher commentaires plus anciens
I am now using matlab neural network toolbox to train my personal neural network. I have no problem with it running perfectly in my personal computer and the HPC(high performance computer) cluster of my university if I set the 'ExecutationEnvironment' property in function trainingOptions to be 'gpu'. and the hpc cluster of my school could provides me more than one GPU. So i just modified my programme and set the 'ExecutationEnvironment' property to 'parallel'. then this code was tested on my personal computer and hte HPC cluster. It could work well on my personal computer, but in my school cluster an error was thrown like that:
Error using trainNetwork (line 154) The parallel pool that SPMD was using has been shut down.
Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 10) TrainMyUnet;
Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) The parallel pool that SPMD was using has been shut down.
The client lost connection to worker 2. This might be due to network problems, or the interactive communicating job might have errored.
Furthermore, my University has several HPC clusters, so I just test this code in another HPC equipped with GPUs. and this time the error is different but the code cannot still work. the error is like :
trainNet Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers. ======================================================================================== | Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning | | (hh:mm:ss) | RMSE | Loss | Rate | ======================================================================================== Error using trainNetwork (line 154) The NCCL library failed to initialize, with error 'unhandled cuda error'.
Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 15) TrainMyUnet;
Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) Error detected on workers 1 2. Error using gpuArray/gop>iNcclReduce (line 305) The NCCL library failed to initialize, with error 'unhandled cuda error'.
is there anyone can help with this problem. Ps: the code could run perfectly on my personal computer. the matlab version I am now using is matlab/2018a; and I also checked the code in nnet.internal.cnn.ParallelTrainer/train (line 67) its just a simple synax: spmd, some codes end.
12 commentaires
Yang Gao
le 26 Juil 2018
Modifié(e) : Walter Roberson
le 27 Juil 2018
Yang Gao
le 26 Juil 2018
Modifié(e) : Walter Roberson
le 27 Juil 2018
Yang Gao
le 26 Juil 2018
Modifié(e) : Walter Roberson
le 27 Juil 2018
Joss Knight
le 26 Juil 2018
Modifié(e) : Joss Knight
le 26 Juil 2018
This looks bad. You may need to call tech support.
On the face of it, your university cluster is not configured to allow GPU communication, in some way that the GPU driver cannot detect. Try a few things. Start a pool on your cluster, then run
spmd, gpuDevice, end
Next try
spmd, gplus(gpuArray.ones(2), 'gpuArray'), end
Yang Gao
le 27 Juil 2018
Yang Gao
le 27 Juil 2018
Modifié(e) : Walter Roberson
le 27 Juil 2018
Yang Gao
le 27 Juil 2018
Modifié(e) : Walter Roberson
le 27 Juil 2018
Walter Roberson
le 27 Juil 2018
Yang Goa comments to Joss Knight:
this code runs with an error in one cluster, while performs well in another GPU-cluster.
Yang Gao
le 27 Juil 2018
Joss Knight
le 27 Juil 2018
There's not much to suggest that isn't quite a lot work. It seems the NVIDIA library we use to communicate between GPUs is either crashing or erroring on this cluster. This could be an OS configuration issue (perhaps it's an unsupported operating system, or some sort of virtualized system), or a GPU configuration issue (a broken driver, hardware setup or virtualization issue). The only way to tackle this is to go through MathWorks tech support, and you will need some system information about the cluster. A good place to start would be to try some older NVIDIA drivers on that system (perhaps starting with the 388 drivers and working upwards).
We can talk about how to work around your problem, by disabling NCCL. This should fix the issue but it will make parallel training slower.
A way that should work is to shadow the gpuArray method gop with a version that removes the classname input. Start by opening gpuArray/gop
edit gpuArray/gop
Then use Save As to save it to a new location; somewhere on your path (or local folder) in a directory called @gpuArray.
Now modify line 47 where it says
if ~strcmp(classname, 'gpuArray')
replacing it with
if true
To check MATLAB sees your new version, try
clear classes
which gpuArray/gop
Now start the pool in your cluster, and add this new version of gop:
addAttachedFiles(gcp, 'gpuArray/gop');
Now run, and you should find everything works. If not, we can try shadowing the the ParallelTrainer class in a similar way to stop it from trying to use NCCL.
Of course, a better solution than this is to use your V100 cluster instead - this will have much better performance anyway.
Sorry about this. We actually test internally on K80 multi-gpu devices, and we are running on the 9.2 drivers ourselves, so something unusual or non-standard is going on.
Yang Gao
le 1 Août 2018
Gökalp
le 16 Nov 2020
Also i got similar error (only line number is 96)
A few weeks ago i ran a similar codes with multiple-gpu but this time i got this error
Error in unetDalak_gt (line 456)
net = trainNetwork(ds,lgraph,options);
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 96)
The parallel pool that SPMD was using has been shut down.
Also i tried the code in my collegues computer (with a single gpu and windows 10) and it worked perfectly. both we are using matlab 2020b
the deatils of my computer gpu details are given below

i tried gop.m without any modification (/usr/local/MATLAB/R2020b/toolbox/parallel/gpu/@gpuArray)
and also i followed the steps of @Joss Knight (/home/medi/Desktop/MATLAB/@gpuArray) but nothing changed
can you help me to solve this error please?
Réponses (0)
Catégories
En savoir plus sur Parallel and Cloud dans Centre d'aide et File Exchange
Produits
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!