Cluster multi-gpu training Error: Current pool is not local.

Hello,
I am trying to scale up onto a multi-gpu cluster for deep learing. I can run the model on a single GPU on the cluster with no issues, however when I try to change to multiple GPU's I get this error:
Current pool is not local. Use 'delete(gcp)' to close parallel pool and run again.
My cluster submission function looks like this:
function job = submit_train_script()
cluster = parcluster();
cluster.AdditionalProperties.AdditionalSubmitArgs = '--gres=gpu:4'; % Request 4 GPU's with sbatch
cluster.AdditionalProperties.AdditionalSubmitArgs = '--mail-type=ALL'; % Send me an email if anything happens
cluster.AdditionalProperties.AdditionalSubmitArgs = '--mail-user=myemail@mydomain.ac.uk';
cluster.AdditionalProperties.AdditionalSubmitArgs = '--nodelist=Node002'; % Use node002
% Submit the job, ask for 4 CPU workers, one for each GPU
job = cluster.batch('train_fun', ...
"AutoAddClientPath",false, "CaptureDiary",true, ...
"CurrentFolder",".", "Pool",4);
end
With the network options below. I request 4 GPU's, four worker CPU's to match and then set the exicution enviroment to "multi-gpu". This appears to be the recommended configuration for this type of work. I cannot work out what is causing this error.
% Iteration = Number of (files*cells) / Minibatchsize
options = trainingOptions("adam", ...
ExecutionEnvironment="multi-gpu", ... % cpu,gpu multi-gpu option avaliable
GradientThreshold=1, ...
InitialLearnRate=0.001,...
MaxEpochs=50, ... % 50
MiniBatchSize= 10, ... % 25 miniBatchSize, ... 10 for 16Gb card,
SequenceLength="longest", ...
Shuffle="never", ...
Verbose=0, ...
Plots="training-progress");
net = trainNetwork(ds,layers,options);
Thanks in advance,
Christopher

 Réponse acceptée

Edric Ellis
Edric Ellis le 13 Jan 2023
I think you need to specify ExecutionEnvironment="parallel" for this situation. According to the trainingOptions reference page, "multi-gpu" is only for "multiple GPUs on one machine, using a local parallel pool based on your default cluster profile."

2 commentaires

Hi Edric,
That seems to work. I hadn't even considered the "parallel" option as I belived that the batch submit would have made the parallel pool local with respect to the cluster. Lesson learned there, thank you!
One stange outcome is a new error, (bearing in mind this code runs without error on a single GPU). The error relates to the 'eq' fucntion which I belive is inbuilt sanity check for the == operator.
The only place the == operator is used in the entire submission is to identify any rows (within the cell variable fridges) which have lables and data I want to exclude. I can do this before I read in the data, however I was wodnering if there is anything obvious that would case this to fail in "gpu" vs "parallel"?
% Exclude lables that we don't care about
includeSet = {'N1_to_N2' 'N2_to_N1' 'N1_to_W' 'W_to_N1' 'N2_to_N3' 'N3_to_N2'};
for j = 1:length(fridges)
% Generate index for where to keep the lables
setidx(j) = sum(fridges{j,2} == includeSet);
end
% remove lables that are not of intrest
fridges(~setidx',:) = [];
Kind regards,
Christopher
I can't see quite why this would change behaviour. Do you have an error stack from the failure indicating this is where the problem is coming from? I would be wary of using == to compare char-vectors (single-quote "strings"). This performs an elementwise comparison of the characters, and can fail if the vectors aren't the same length. You might be better off using strcmp.

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by