What is the origin of this bus error?

37 vues (au cours des 30 derniers jours)
Wouter
Wouter le 1 Oct 2019
I had been running some monte-carlo simulations on a cluster node (Linux) for over a week using parfor, when a crash happened at about 70% done (time evolution, so the problem does not become progressively harder), and I don't understand the report. Luckily I saved some intermediate results, but I would prefer to have an idea of what went wrong before I try again. In principle, all code in the script has been accessed before on the same machine without troubles.
The error is the following:
[Warning: A worker aborted during execution of the parfor loop. The parfor loop
will now run again on the remaining workers.]
[> In parallel_function (line 599)
In seekGdeptransition_forcluster_Nrealdep (line 51)]
--------------------------------------------------------------------------------
Bus error detected at Sat Sep 28 05:55:53 2019 +0200
--------------------------------------------------------------------------------
Configuration:
Crash Decoding : Disabled - No sandbox or build area path
Crash Mode : continue (default)
Default Encoding : UTF-8
Deployed : false
GNU C Library : 2.17 stable
Graphics Driver : Unknown software
Java Version : Java 1.8.0_144-b01 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
MATLAB Architecture : glnxa64
MATLAB Entitlement ID : 815978
MATLAB Root : /ssoft/spack/external/MATLAB/R2018a
MATLAB Version : 9.4.0.813654 (R2018a)
OpenGL : software
Operating System : "Red Hat Enterprise Linux Server release 7.6 (Maipo)"
Process ID : 18832
Processor ID : x86 Family 6 Model 79 Stepping 1, GenuineIntel
Session Key : db19bbbe-1534-4337-b32d-f6c8548df595
Static TLS mitigation : Disabled: Unable to open display
Window System : No active display
Fault Count: 1
Abnormal termination
Register State (from fault):
RAX = 00002ac3ad3a2c40 RBX = 0000000000000000
RCX = 00002ac37e0e2d12 RDX = 0000000000000000
RSP = 00002ac3d650b878 RBP = 00002ac3d650b8e0
RSI = 0000000000000000 RDI = 00002ac3b2f1ef50
R8 = 00002ac3b2f1ef28 R9 = 0000000000000000
R10 = 00002ac3d650b8a0 R11 = 0000000000000000
R12 = 000000000000006e R13 = 00002ac3b2f1ef00
R14 = 00002ac3b2f1ef50 R15 = 00002ac3b2f1ef28
RIP = 00002ac3ac643fd0 EFL = 0000000000010202
CS = 0033 FS = 0000 GS = 0000
Stack Trace (from fault):
[ 0] 0x00002ac3ac643fd0 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+02228176
[ 1] 0x00002ac3acd4cad0 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09603792
[ 2] 0x00002ac3acd0815e /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09322846
[ 3] 0x00002ac3acd08726 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09324326
[ 4] 0x00002ac3ace96c01 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+10955777
[ 5] 0x00002ac3ace9843e /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+10961982
[ 6] 0x00002ac3acd4e338 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09610040
[ 7] 0x00002ac37e0dedd5 /lib64/libpthread.so.0+00032213
[ 8] 0x00002ac37c86502d /lib64/libc.so.6+01040429 clone+00000109
[ 9] 0x0000000000000000 <unknown-module>+00000000
** This crash report has been saved to disk as /home/wverstra/matlab_crash_dump.18832-1 **
MATLAB is exiting because of fatal error
/var/spool/slurmd/job2941726/slurm_script: line 13: 18832 Killed matlab -nodisplay -r "seekGdeptransition_forcluster_Nrealdep(10,100);quit"
FINISHED at Sat Sep 28 05:55:54 CEST 2019
slurmstepd: error: Detected 2 oom-kill event(s) in step 2941726.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
note that line 51 of file "seekGdeptransition_forcluster_Nrealdep.m" is just
parfor rr=1:Nreal

Réponse acceptée

Daniel M
Daniel M le 19 Oct 2019
Seems like you are running too many processes and ran out of memory. I've had this happen before and I just needed to limit my parpool to a smaller size.

Plus de réponses (1)

Raymond Norris
Raymond Norris le 4 Juil 2020
Hi,
When you submit your Slurm job, you can specify the flag
--mem-per-cpu=<mem, usually in gb>
look to increase that. If you need to run on more cores/nodes, try running the MATLAB Parallel Server, which expands past a single node. Contact support@mathworks.com for more information on MATLAB Parallel Server or help with configuring your Slurm job.

Catégories

En savoir plus sur Cluster Configuration dans Help Center et File Exchange

Produits


Version

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by