Validation parallel cluster profile error because of the plugin function 'independentSubmitFcn.m' error

20 vues (au cours des 30 derniers jours)
Got error in validating parallel cluster profile, the error message is:
Error Report: Job submission failed because the plugin function 'independentSubmitFcn.m' errored.
Caused by:Brace indexing is not supported for variables of this type.
I used matlab-slurm plugins provided in this github repo. This seems confusing since the same cluster profile can be validated several days ago.
Thanks for any reply!
  2 commentaires
Damian Pietrus
Damian Pietrus le 27 Oct 2023
A few questions before we do some troubleshooting:
  • Is the client that's submitting jobs on the cluster itself, or are you on a remote machine?
  • Have you made any edits to the plugin files themselves?
  • Does the error continue after restarting MATLAB?
We can try to manually submit the job to get more information from the log file. Please make sure that the Slurm cluster is set as your default from the "Parallel" drop-down menu, then try the following steps:
c=parcluster;
% Independent job
j=batch(c,@pwd,1,{});
If the job successfully submits, we can then wait for the job to finish before getting the log file. If the job does not submit, please let me know if the error message is the same as in your post or if it changed.
% If the job submitted, wait for it to finish
j.wait
% Get the log file for the independent job
c.getDebugLog(j.Tasks(1));
Wei Jianwen
Wei Jianwen le 9 Nov 2023
Hi Damian,
There are some additional infomation:
  • This client submits a slurm job with a remote client, need to input username and password when parpool is started
  • I don't modify the plugin function 'independentSubmitFcn.m' mentioned in error message
  • I can start parallel pool normally after restarting MATLAB, but the same error may occur after several days
  • I set slurm cluster as default in cluster profile manager, is that right?
Since restarting MATLAB can fix this problem, I haven't tried to manually submit jobs, I will write aother comment for this post if I do so.
Thanks for your reply! :D

Connectez-vous pour commenter.

Réponses (1)

Damian Pietrus
Damian Pietrus le 10 Nov 2023
Hey Wei,
Thanks for sending that additional information. MATLAB uses SSH to connect to and run commands on a remote cluster. When MATLAB is left open for a long period of time, that connection may end up breaking down for one reason or another. Once it's broken, any additional interactions with the cluster will fail until a new connection is established. To work around the issue you can restart MATLAB or you can try the following to see if it helps:
clear all force
c=parcluster;
% Interact with the cluster here. You can use the Job Monitor, submit a
% new job, etc.

Catégories

En savoir plus sur Third-Party Cluster Configuration dans Help Center et File Exchange

Produits


Version

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by