Effacer les filtres
Effacer les filtres

Parpool consistently failing to initialize

13 vues (au cours des 30 derniers jours)
Ross Volzer
Ross Volzer le 29 Août 2019
I'm consistently running into problems getting parpool to initialize on linux clusters. These systems typically have 39 to 128 idle cores and 76GB to 4.5TB of free RAM. Sometimes I can launch a parpool with 128 workers, other times I can't start one with as little as 4 workers. I've been using Matlab R2019a and R2018b. Any ideas?
>> n=16; %number of workers you want
>> setlocal = parcluster('local');
>> setlocal.NumWorkers = n;
>> parpool(setlocal);
Starting parallel pool (parpool) using the 'local' profile ...
Warning: The system time zone setting, 'Navajo', does not specify a single time
zone unambiguously. It will be treated as 'America/Denver'. See the <a
href="matlab:doc('datetime.TimeZone')">datetime.TimeZone property</a> for
details about specifying time zones.
> In verifyTimeZone (line 34)
In datetime (line 543)
In parallel.internal.cluster.FileSerializer>iLoadDate (line 342)
In parallel.internal.cluster.FileSerializer/getFields (line 100)
In parallel.internal.cluster.CJSSupport/getProperties (line 260)
In parallel.internal.cluster.CJSSupport/getJobProperties (line 478)
In parallel.internal.cluster.CJSJobMixin/hGetProperty (line 85)
In parallel.internal.cluster.CJSJobMethods.setJobTerminalStateFromCluster (line 179)
In parallel.internal.cluster.CJSJobMixin/hSetTerminalStateFromCluster (line 116)
In parallel.cluster.CJSCluster/hGetJobState (line 401)
In parallel.internal.cluster.CJSJobMixin/getStateEnum (line 159)
In parallel.Job/get.StateEnum (line 238)
In parallel.Job/get.State (line 230)
In parallel.internal.customattr.CustomGetSet>iVectorisedGetHelper (line 128)
In parallel.internal.customattr.CustomGetSet>@(a,b,c)iVectorisedGetHelper(obj,a,b,c) (line 102)
In parallel.internal.customattr.CustomGetSet/doVectorisedGet (line 103)
In parallel.internal.customattr.CustomGetSet/hVectorisedGet (line 76)
In parallel.internal.customattr.GetSetImpl>iAccessProperties (line 322)
In parallel.internal.customattr.GetSetImpl>iGetAllPropertiesVec (line 264)
In parallel.internal.customattr.GetSetImpl.getImpl (line 133)
In parallel.internal.customattr.CustomGetSet>iHetFunGetFunction (line 154)
In parallel.internal.customattr.CustomGetSet>@(o)iHetFunGetFunction(o,props) (line 139)
In parallel.internal.cluster.hetfun (line 46)
In parallel.internal.customattr.CustomGetSet>iHetFunGetProperty (line 139)
In parallel.internal.customattr.CustomGetSet/get (line 38)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 474)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 796)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parallel.Cluster/parpool (line 71)
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: corrupted double-linked list: 0x00007f3c402e1bc0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c402b1e00 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40249390 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40238380 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f3c40238380 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40019530 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40019110 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3211a75e5e]
/lib64/libc.so.6[0x3211a78cf0]
/usr/local/matlab/R2018b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so(+0x5dccb9)[0x7f3c471bccb9]
/lib64/libc.so.6(exit+0xe2)[0x3211a35a02]
/usr/local/matlab/R2018b/bin/glnxa64/libtbb.so.2(+0x1cb1a)[0x7f3c6f2cbb1a]
/usr/local/matlab/R2018b/bin/glnxa64/libtbb.so.2(+0x1c5ce)[0x7f3c6f2cb5ce]
/usr/local/matlab/R2018b/bin/glnxa64/libtbb.so.2(+0x1c5a6)[0x7f3c6f2cb5a6]
/lib64/libpthread.so.0[0x3211e07aa1]
/lib64/libc.so.6(clone+0x6d)[0x3211ae8c4d]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 00:23 419417053 /usr/local/matlab/R2018b/bin/glnxa64/MATLAB
0060d000-0060e000 r--p 0000d000 00:23 419417053 /usr/local/matlab/R2018b/bin/glnxa64/MATLAB
0060e000-0060f000 rw-p 0000e000 00:23 419417053 /usr/local/matlab/R2018b/bin/glnxa64/MATLAB
0206a000-0224a000 rw-p 00000000 00:00 0 [heap]
3211600000-3211620000 r-xp 00000000 08:03 14577410 /lib64/ld-2.12.so
3211820000-3211821000 r--p 00020000 08:03 14577410 /lib64/ld-2.12.so
3211821000-3211822000 rw-p 00021000 08:03 14577410 /lib64/ld-2.12.so
3211822000-3211823000 rw-p 00000000 00:00 0
3211a00000-3211b8b000 r-xp 00000000 08:03 14577415 /lib64/libc-2.12.so
3211b8b000-3211d8a000 ---p 0018b000 08:03 14577415 /lib64/libc-2.12.so
3211d8a000-3211d8e000 r--p 0018a000 08:03 14577415 /lib64/libc-2.12.so
3211d8e000-3211d90000 rw-p 0018e000 08:03 14577415 /lib64/libc-2.12.so
3211d90000-3211d94000 rw-p 00000000 00:00 0
3211e00000-3211e17000 r-xp 00000000 08:03 14577416 /lib64/libpthread-2.12.so
3211e17000-3212017000 ---p 00017000 08:03 14577416 /lib64/libpthread-2.12.so
3212017000-3212018000 r--p 00017000 08:03 14577416 /lib64/libpthread-2.12.so
3212018000-3212019000 rw-p 00018000 08:03 14577416 /lib64/libpthread-2.12.so
3212019000-321201d000 rw-p 00000000 00:00 0
3212200000-3212283000 r-xp 00000000 08:03 14577561 /lib64/libm-2.12.so
3212283000-3212482000 ---p 00083000 08:03 14577561 /lib64/libm-2.12.so
3212482000-3212483000 r--p 00082000 08:03 14577561 /lib64/libm-2.12.so
3212483000-3212484000 rw-p 00083000 08:03 14577561 /lib64/libm-2.12.so
3212600000-3212602000 r-xp 00000000 08:03 14577435 /lib64/libdl-2.12.so
3212602000-3212802000 ---p 00002000 08:03 14577435 /lib64/libdl-2.12.so
3212802000-3212803000 r--p 00002000 08:03 14577435 /lib64/libdl-2.12.so
3212803000-3212804000 rw-p 00003000 08:03 14577435 /lib64/libdl-2.12.so
3212a00000-3212a15000 r-xp 00000000 08:03 14577501 /lib64/libz.so.1.2.3
3212a15000-3212c14000 ---p 00015000 08:03 14577501 /lib64/libz.so.1.2.3
3212c14000-3212c15000 r--p 00014000 08:03 14577501 /lib64/libz.so.1.2.3
3212c15000-3212c16000 rw-p 00015000 08:03 14577501 /lib64/libz.so.1.2.3
3212e00000-3212e07000 r-xp 00000000 08:03 14577419 /lib64/librt-2.12.so
3212e07000-3213006000 ---p 00007000 08:03 14577419 /lib64/librt-2.12.so
3213006000-3213007000 r--p 00006000 08:03 14577419 /lib64/librt-2.12.so
3213007000-3213008000 rw-p 00007000 08:03 14577419 /lib64/librt-2.12.so
3214600000-3214602000 r-xp 00000000 08:03 6962157 /usr/lib64/libXau.so.6.0.0
3214602000-3214802000 ---p 00002000 08:03 6962157 /usr/lib64/libXau.so.6.0.0
3214802000-3214803000 rw-p 00002000 08:03 6962157 /usr/lib64/libXau.so.6.0.0
3214a00000-3214a24000 r-xp 00000000 08:03 6962578 /usr/lib64/libxcb.so.1.1.0
3214a24000-3214c24000 ---p 00024000 08:03 6962578 /usr/lib64/libxcb.so.1.1.0
3214c24000-3214c25000 rw-p 00024000 08:03 6962578 /usr/lib64/libxcb.so.1.1.0
3214e00000-3214f37000 r-xp 00000000 08:03 6962601 /usr/lib64/libX11.so.6.3.0
3214f37000-3215137000 ---p 00137000 08:03 6962601 /usr/lib64/libX11.so.6.3.0
3215137000-321513d000 rw-p 00137000 08:03 6962601 /usr/lib64/libX11.so.6.3.0
3215200000-3215211000 r-xp 00000000 08:03 6963060 /usr/lib64/libXext.so.6.4.0
3215211000-3215411000 ---p 00011000 08:03 6963060 /usr/lib64/libXext.so.6.4.0
3215411000-3215412000 rw-p 00011000 08:03 6963060 /usr/lib64/libXext.so.6.4.0
3218e00000-3218e04000 r-xp 00000000 08:03 14577556 /lib64/libuuid.so.1.3.0
3218e04000-3219003000 ---p 00004000 08:03 14577556 /lib64/libuuid.so.1.3.0
3219003000-3219004000 rw-p 00003000 08:03 14577556 /lib64/libuuid.so.1.3.0
3219200000-321920c000 r-xp 00000000 08:03 14578841 /lib64/libpam.so.0.82.2
321920c000-321940c000 ---p 0000c000 08:03 14578841 /lib64/libpam.so.0.82.2
321940c000-321940d000 r--p 0000c000 08:03 14578841 /lib64/libpam.so.0.82.2
321940d000-321940e000 rw-p 0000d000 08:03 14578841 /lib64/libpam.so.0.82.2
3219e00000-3219e07000 r-xp 00000000 08:03 6962154 /usr/lib64/libSM.so.6.0.1
3219e07000-321a007000 ---p 00007000 08:03 6962154 /usr/lib64/libSM.so.6.0.1
321a007000-321a008000 rw-p 00007000 08:03 6962154 /usr/lib64/libSM.so.6.0.1
321aa00000-321aa17000 r-xp 00000000 08:03 6961964 /usr/lib64/libICE.so.6.3.0
321aa17000-321ac17000 ---p 00017000 08:03 6961964 /usr/lib64/libICE.so.6.3.0
321ac17000-321ac18000 rw-p 00017000 08:03 6961964 /usr/lib64/libICE.so.6.3.0
321ac18000-321ac1c000 rw-p 00000000 00:00 0
3220200000-3220207000 r-xp 00000000 08:03 14577723 /lib64/libcrypt-2.12.so
3220207000-3220407000 ---p 00007000 08:03 14577723 /lib64/libcrypt-2.12.so
3220407000-3220408000 r--p 00007000 08:03 14577723 /lib64/libcrypt-2.12.so
3220408000-3220409000 rw-p 00008000 08:03 14577723 /lib64/libcrypt-2.12.so*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c26c710 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2f1da0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2a4ac0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2a1500 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f066c2a1500 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2973b0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f066c2a1500 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c245a70 ***
*** glibc detected *** *** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f066c2a1500 ***
Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
676)
Failed to initialize the interactive session.
Error using
parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus
(line 790)
The interactive communicating job failed with no message.
  6 commentaires
Frank Schluenzen
Frank Schluenzen le 2 Déc 2020
Modifié(e) : Frank Schluenzen le 3 Déc 2020
I'm having the same problems with R2020b (and >2019a) on Centos_7. I also see the glibc-errors, but validation of local parallel pools was more pointing towards tbb-threads running out of memory. The only place I could find references to Heap settings were in ~/.matlab/R2020b/matlab.prf and indirectly in ~/.matlab/R2020b/toolbox_cache-9.9.0-34542918-glnxa64.xml.
So after fiddling a lot with stack and heap settings I finally simply removed ~/.matlab/R2020b/toolbox_cache-9.9.0-34542918-glnxa64.xml before starting matlab, and could consistently use all available 40+ cores on various different machines. Disabling Toolbox path caching under Matlab General Preferences seemingly does the same, and is of course the better choice. No idea why that would help, but seems at least to work for R2020b on Centos_7.
Frank Schluenzen
Frank Schluenzen le 4 Déc 2020
mathworks recommendation: ulimit -u 63536. works with toolbox path cache enabled

Connectez-vous pour commenter.

Réponses (0)

Catégories

En savoir plus sur Parallel Computing Fundamentals dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by