Job submission in the slurm based cluster, need help in submitting multiple tasks as a single job in slurm.
11 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Gangadhar Venkata Ramana
le 19 Sep 2023
Commenté : Gangadhar Venkata Ramana
le 25 Sep 2023
c = parcluster;
c.AdditionalProperties.AccountName = '';
c.AdditionalProperties.MemUsage = '4gb';
c.AdditionalProperties.RequireExclusiveNode = false;
c.AdditionalProperties.WallTime = '05:00:00';
c.AdditionalProperties.ProcsPerNode = '';
c.AdditionalProperties.AdditionalSubmitArgs = '';
c.AdditionalProperties.QueueName = 'shared';
c.AdditionalProperties.QoS = '';
c.AdditionalProperties.AdditionalSubmitArgs = '--chdir=/scratch/'; c.AdditionalProperties.RemoteJobStorageLocation = '/scratch/';
k = createJob(c);
for i = 1:r_zones
for j = 1:c_zones
createTask(k,@com_sigma,3,{Tp,ds,T,dt,tsteps,i,j,rz,cz,forward});
end
end
submit(k);
disp("All jobs submitted")
wait(k);
The above method is creating the job submission script for every task and submitting it to each core. For instance, If I have 50 tasks in my code, it creates 50 slurm job submission scripts for 50 individual jobs in the cluster (similar to the below)
[21ae91p@login06 pqc]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
1251791_50 shared Job17 21ae91p0 R 0:13 1 cn049
1251791_49 shared Job17 21ae91p0 R 0:14 1 cn010
1251791_48 shared Job17 21ae91p0 R 0:14 1 cn009
1251791_47 shared Job17 21ae91p0 R 0:14 1 cn009
1251791_46 shared Job17 21ae91p0 R 0:14 1 cn009
.........................
...........................
In this way, I am able to run my program. However, I have a limitation of a maximum of 100 jobs (and 'n' number of cores per job) to run in the cluster. So that I am able to use only 100 cores at a time to run my program, as each job is using one core.
I need an alternative, such that the program submits only one job in which multiple tasks are running (mutiple cores at a time).
Any help would be greatly appreciated.
0 commentaires
Réponse acceptée
Damian Pietrus
le 19 Sep 2023
Edit: Changing from a comment to an answer and adding some additional information
As you've pointed out, what you currently have is a job array of many indpendent tasks that all run on one core. If you'd like to submit one job that utilizes multiple cores, you'll have to submit a communicating job that contains some sort of parallel contruct. If you take a look at your @com_sigma function, does it have any code that can be converted into a parfor loop? Since each of your jobs are independent, you might be able to move the whole loop into com_sigma and then call it with the 'pool' argument in your job, with 50 being the number of workers selected for your job:
job = c.batch(@com_sigma,3,{Tp,ds,T,dt,tsteps,i,j,rz,cz,forward}, 'pool', 50);
If you are not yet aware, you can prototype parfor loops with a local or cluster parpool before submitting a batch job.
I'd also like to suggest a few things with your list of c.AdditionalProperties. I usually recommend against changing the RemoteJobStorageLocation to simply just '/scratch', as this may interfere with other users. Having something more definitive which includes your username information like '/scratch/$USER/matlab-jobs' can save you some headache in the future.
Additionally, your c.AdditionalProperties.ProcsPerNode value doesn't neet to be set to a character string, simply having the following should be enough:
c.AdditionalProperties.ProcsPerNode = 0;
Let me know if that helps!
3 commentaires
Sam Marshalik
le 20 Sep 2023
It looks like you are specifying that your batch function has 3 outputs:
job = c.batch(@com_bridge,3,{Tp,ds,T,dt,tsteps,k,rz,cz,forward},'pool',35);
If, when fetching the results, the number of outputs does not match that number, you will see this error. My guess is either the job itself is failing at some point and no outputs are actually generated or there is some logic in your code that makes it so that the necessary output variables are not populated.
You can look at the diary of the job to see if there are any warnings/errors.
Plus de réponses (0)
Voir également
Catégories
En savoir plus sur Third-Party Cluster Configuration dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!