Problems with training Network with multi-gpus

I am now using matlab neural network toolbox to train my personal neural network. I have no problem with it running perfectly in my personal computer and the HPC(high performance computer) cluster of my university if I set the 'ExecutationEnvironment' property in function trainingOptions to be 'gpu'. and the hpc cluster of my school could provides me more than one GPU. So i just modified my programme and set the 'ExecutationEnvironment' property to 'parallel'. then this code was tested on my personal computer and hte HPC cluster. It could work well on my personal computer, but in my school cluster an error was thrown like that:
Error using trainNetwork (line 154) The parallel pool that SPMD was using has been shut down.
Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 10) TrainMyUnet;
Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) The parallel pool that SPMD was using has been shut down.
The client lost connection to worker 2. This might be due to network problems, or the interactive communicating job might have errored.
Furthermore, my University has several HPC clusters, so I just test this code in another HPC equipped with GPUs. and this time the error is different but the code cannot still work. the error is like :
trainNet Starting parallel pool (parpool) using the 'local' profile ... connected to 2 workers. ======================================================================================== | Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning | | (hh:mm:ss) | RMSE | Loss | Rate | ======================================================================================== Error using trainNetwork (line 154) The NCCL library failed to initialize, with error 'unhandled cuda error'.
Error in TrainMyUnet (line 19) [net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 15) TrainMyUnet;
Caused by: Error using nnet.internal.cnn.ParallelTrainer/train (line 67) Error detected on workers 1 2. Error using gpuArray/gop>iNcclReduce (line 305) The NCCL library failed to initialize, with error 'unhandled cuda error'.
is there anyone can help with this problem. Ps: the code could run perfectly on my personal computer. the matlab version I am now using is matlab/2018a; and I also checked the code in nnet.internal.cnn.ParallelTrainer/train (line 67) its just a simple synax: spmd, some codes end.

12 commentaires

Yang Gao
Yang Gao le 26 Juil 2018
Modifié(e) : Walter Roberson le 27 Juil 2018
the complete error information shown in the first cluster:
Starting parallel pool (parpool) using the 'local' profile ...
Preserving jobs with IDs: 1 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile local. To create 'myCluster' use 'myCluster = parcluster('local')'.
connected to 4 workers.
|========================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning |
| | | (hh:mm:ss) | RMSE | Loss | Rate |
|========================================================================================|
Error using trainNetwork (line 154)
The parallel pool that SPMD was using has been shut down.
Error in TrainMyUnet (line 19)
[net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 10)
TrainMyUnet;
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 67)
The parallel pool that SPMD was using has been shut down.
The client lost connection to worker 2. This might be due to network problems,
or the interactive communicating job might have errored.
the gpus in this cluster is telsa k80.
Yang Gao
Yang Gao le 26 Juil 2018
Modifié(e) : Walter Roberson le 27 Juil 2018
the complete error information in the second cluster:
Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
|========================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning |
| | | (hh:mm:ss) | RMSE | Loss | Rate |
|========================================================================================|
Error using trainNetwork (line 154)
The NCCL library failed to initialize, with error 'unhandled cuda error'.
Error in TrainMyUnet (line 19)
[net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
Error in tarinTask (line 15)
TrainMyUnet;
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 67)
Error detected on workers 1 2.
Error using gpuArray/gop>iNcclReduce (line 305)
The NCCL library failed to initialize, with error 'unhandled cuda
error'.
the gpus in this cluster are telsa V100
Yang Gao
Yang Gao le 26 Juil 2018
Modifié(e) : Walter Roberson le 27 Juil 2018
here is my code for training:
%%intialize my Unet
myUnet = createUnet([256,256,1]);
% %%training set data;
%%training optins
initialLearningRate = 0.001;
maxEpochs = 2000;
minibatchSize = 32;
l2reg = 0.000000;
options = trainingOptions('rmsprop',...
'L2Regularization',l2reg,...
'MaxEpochs',maxEpochs,...
'MiniBatchSize',minibatchSize,...
'VerboseFrequency',20,...
'Shuffle','every-epoch',...
'ExecutionEnvironment','parallel');
%%training function;
[net, info] = trainNetwork(trainSet, trainLabel, myUnet, options);
really need some help
Joss Knight
Joss Knight le 26 Juil 2018
Modifié(e) : Joss Knight le 26 Juil 2018
This looks bad. You may need to call tech support.
On the face of it, your university cluster is not configured to allow GPU communication, in some way that the GPU driver cannot detect. Try a few things. Start a pool on your cluster, then run
spmd, gpuDevice, end
Next try
spmd, gplus(gpuArray.ones(2), 'gpuArray'), end
thanks for your response. I have tried your code, and the results is like below:
Lab 1:
ans =
CUDADevice with properties:
Name: 'Tesla K80'
Index: 1
ComputeCapability: '3.7'
SupportsDouble: 1
DriverVersion: 9.2000
ToolkitVersion: 9
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.2800e+10
AvailableMemory: 1.2662e+10
MultiprocessorCount: 13
ClockRateKHz: 823500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Lab 2:
ans =
CUDADevice with properties:
Name: 'Tesla K80'
Index: 2
ComputeCapability: '3.7'
SupportsDouble: 1
DriverVersion: 9.2000
ToolkitVersion: 9
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.2800e+10
AvailableMemory: 1.2662e+10
MultiprocessorCount: 13
ClockRateKHz: 823500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Error using gpuHelp (line 5) The parallel pool that SPMD was using has been shut down.
The client lost connection to worker 1. This might be due to network problems, or the interactive communicating job might have errored.
it seems that the first statement can work while the second one is not working, and throw an error : the parpool that the SPMD is using has been shut down
Yang Gao
Yang Gao le 27 Juil 2018
Modifié(e) : Walter Roberson le 27 Juil 2018
below is the context of gpuHelp.m :
parpool(2);
spmd, gpuDevice, end;
spmd, gplus(gpuArray.ones(2),'gpuArray'), end;
Yang Gao
Yang Gao le 27 Juil 2018
Modifié(e) : Walter Roberson le 27 Juil 2018
I have also tested the code in another HPC cluster equipped with telsa V100, and the result is like below:
Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
ans =
Pool with properties:
Connected: true
NumWorkers: 2
Cluster: local
AttachedFiles: {}
AutoAddClientPath: true
IdleTimeout: 30 minutes (30 minutes remaining)
SpmdEnabled: true
Lab 1:
ans =
CUDADevice with properties:
Name: 'Tesla V100-PCIE-16GB'
Index: 1
ComputeCapability: '7.0'
SupportsDouble: 1
DriverVersion: 9.1000
ToolkitVersion: 9
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.6946e+10
AvailableMemory: 1.6371e+10
MultiprocessorCount: 80
ClockRateKHz: 1380000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Lab 2:
ans =
CUDADevice with properties:
Name: 'Tesla V100-PCIE-16GB'
Index: 2
ComputeCapability: '7.0'
SupportsDouble: 1
DriverVersion: 9.1000
ToolkitVersion: 9
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.6946e+10
AvailableMemory: 1.6371e+10
MultiprocessorCount: 80
ClockRateKHz: 1380000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Lab 1:
ans =
2 2
2 2
Lab 2:
ans =
2 2
2 2
Yang Goa comments to Joss Knight:
this code runs with an error in one cluster, while performs well in another GPU-cluster.
Yang Gao
Yang Gao le 27 Juil 2018
what next shall I do to detect my error source ?
There's not much to suggest that isn't quite a lot work. It seems the NVIDIA library we use to communicate between GPUs is either crashing or erroring on this cluster. This could be an OS configuration issue (perhaps it's an unsupported operating system, or some sort of virtualized system), or a GPU configuration issue (a broken driver, hardware setup or virtualization issue). The only way to tackle this is to go through MathWorks tech support, and you will need some system information about the cluster. A good place to start would be to try some older NVIDIA drivers on that system (perhaps starting with the 388 drivers and working upwards).
We can talk about how to work around your problem, by disabling NCCL. This should fix the issue but it will make parallel training slower.
A way that should work is to shadow the gpuArray method gop with a version that removes the classname input. Start by opening gpuArray/gop
edit gpuArray/gop
Then use Save As to save it to a new location; somewhere on your path (or local folder) in a directory called @gpuArray.
Now modify line 47 where it says
if ~strcmp(classname, 'gpuArray')
replacing it with
if true
To check MATLAB sees your new version, try
clear classes
which gpuArray/gop
Now start the pool in your cluster, and add this new version of gop:
addAttachedFiles(gcp, 'gpuArray/gop');
Now run, and you should find everything works. If not, we can try shadowing the the ParallelTrainer class in a similar way to stop it from trying to use NCCL.
Of course, a better solution than this is to use your V100 cluster instead - this will have much better performance anyway.
Sorry about this. We actually test internally on K80 multi-gpu devices, and we are running on the 9.2 drivers ourselves, so something unusual or non-standard is going on.
Yang Gao
Yang Gao le 1 Août 2018
many thanks for your response, and sorry about for responding so late cause I was sick last two days. I will try your code. thanks.
Also i got similar error (only line number is 96)
A few weeks ago i ran a similar codes with multiple-gpu but this time i got this error
Error in unetDalak_gt (line 456)
net = trainNetwork(ds,lgraph,options);
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 96)
The parallel pool that SPMD was using has been shut down.
Also i tried the code in my collegues computer (with a single gpu and windows 10) and it worked perfectly. both we are using matlab 2020b
the deatils of my computer gpu details are given below
i tried gop.m without any modification (/usr/local/MATLAB/R2020b/toolbox/parallel/gpu/@gpuArray)
and also i followed the steps of @Joss Knight (/home/medi/Desktop/MATLAB/@gpuArray) but nothing changed
can you help me to solve this error please?

Connectez-vous pour commenter.

Réponses (0)

Question posée :

le 26 Juil 2018

Commenté :

le 16 Nov 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by