Why do I get MPI_Abort errors when trying to submit a parallel job?

2 vues (au cours des 30 derniers jours)
Paul Zhang
Paul Zhang le 23 Mai 2014
Commenté : Edric Ellis le 23 Mai 2014
The core of my job submission code is below:
jopt.email_notif = 0;
jopt.toggleleft = left_list(j);
jopt.toggleCausalDir = dir_list(k);
jopt.toggleChoice = choice(l);
jopt.od_number = od_list(i);
jopt.connectivity = 1;
sched = findResource('scheduler', 'configuration', 'NeuroEcon.local')
set(sched,'SubmitArguments', '-l walltime=0:20:00')
pjob = createParallelJob(sched);
set(pjob, 'FileDependencies', {'multiDCMset1.m'})
set(pjob, 'MaximumNumberOfWorkers', 1)
set(pjob, 'MinimumNumberOfWorkers', 1)
t = createTask(pjob, @multiDCMset1, 1, {jopt})
t_all{1,jj}=t; jj=jj+1;
submit(pjob);
---------------------------------------
The following is the error message I get in the job submission log, after the job finishes running. I don't understand the error or what could cause it. I do know that the same script runs fine on another person's computer. Do I need some specific settings to submit parallel jobs?
------------------
Node file: /opt/torque/aux//2075983.neuroecon.caltech.edu
Starting SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -s -phrase MATLAB -port 25983
All SMPDs launched
"/opt/matlab//bin/mw_mpiexec" -phrase MATLAB -port 25983 -l -n 1
-machinefile /opt/torque/aux//2075983.neuroecon.caltech.edu -genvlist
MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE _CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG
"/opt/matlab/bin/worker" -parallel
[0]which: no shopt in
(/opt/matlab/bin:/usr/kerberos/bin:/usr/java/latest/bin:/opt /intel/itac/7.1/bin:/opt/intel/fce/10.1.018/bin:/opt/intel /idbe/10.1.018/bin:/opt/intel/cce/10.1.018/bin:/usr/local /bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt /openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin: /opt/rocks/bin:/opt/rocks/sbin)
[0] < M A T L A B (R) >
[0] Copyright 1984-2009 The MathWorks, Inc.
[0] Version 7.8.0.347 (R2009a) 64-bit (glnxa64)
[0] February 12, 2009
[0]
[0] To get started, type one of these: helpwin, helpdesk, or demo.
[0] For product information, visit www.mathworks.com.
[0]
job aborted:
rank: node: exit code[: error message]
0: compute-1-30: -2: application called MPI_Abort(MPI_COMM_WORLD, 42) -
process 0
Stopping SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -shutdown -phrase MATLAB -port
25983
Exiting with code: 42
  1 commentaire
Edric Ellis
Edric Ellis le 23 Mai 2014
Is there any error in the task of the job? Check using:
pjob.Tasks(1).Error
or even
getReport(pjob.Tasks(1).Error)

Connectez-vous pour commenter.

Réponses (0)

Catégories

En savoir plus sur Cluster Configuration dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by