Major speed reduction (~50x) when doing multiple matrix multiply operations in a row. Is this a (major) bug? Any ideas for a fix?

4 vues (au cours des 30 derniers jours)
The code I am running more-or-less implements the following
% PREALLOCATE. For the current tests I'm running: N=1200, M=2049, and L=256
dataCur = <some [N x N x M] array>
dataOut = zeros(L,N,N);
% START PRIMARY / OUTER LOOP
for kkOuter = 1:L
% COMPUTE 'dataOut' SLICE FOR CURRENT OUTER LOOP ITERATION
% NOTE: this is the same as doing "dataOut(kkOuter,:,:) = reshape(sum(dataCur,3),[1,N,N])"
dataOut(kkOuter,:,:) = reshape((reshape(dataCur,[],M)*ones(M,1)).',[1,N,N]);
% UPDATE 'dataCur' BY LOOPING THROUGH IT ONE SLICE AT A TIME USING AN INNER LOOP
for kkInner = 1:M
W = someFunction(kkInner); % this is a [N x N] array
dataCur(:,:,kkInner) = transpose(W) * dataCur(:,:,kkInner) * W;
end
end
The evolution of the code's efficiency is shown here (note: the x axis is actually inner loop iterations / 10, since I collected time data every 10th iteration of the inner loop).
Clearly something is wrong.
Note that there is a "shift" where it goes from a "slow mode" of ~0.5 iterations per second to a "fast mode" of ~20 iterations per second. This can be seen in the upper image. The lower image shows overall average rate since the beginning of the current output loop iteration, so you dont see the sudden shift there as much.
.
WHAT IVE FIGURED OUT
The problem seems to be related to thread scheduling somehow or another. In that link the first 2 images show it running in "fast mode" (on linux mint), and the last 3 show it running in "slow mode".
Unfortunately, thats about all Ive got. Ive ruled a few things out (see below), but have no idea why this consistently keeps happening.
.
WHAT IVE RULED OUT
1) The OS. Unless Windows 10 and Linux Mint 19 both have this exact same issue.
2) MKL. Unless MKL 2017.0, 2018.0, 2018.3 and 2019.0 all have the exact same problem. 2018.3 and 2019.0 were used by launching MATLAB with the following command (this shows 2019.0 oon linux, using windows and using 2018.3 are fairly analogous)
/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/bin/mklvars.sh intel64 ilp64 && export "BLAS_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "LAPACK_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "MKL_THREADING_LAYER=INTEL" && export "MKL_INTERFACE_LAYER=ILP64" && "/usr/local/MATLAB/$(ls /usr/local/MATLAB | sort -g | tail -n 1)/bin/matlab"
So, unless the processor is straight ignoring the requested thread scheduling for a while (which idk if that is even possible), MATLAB seems the most likely culprit.
3) Memory fragmentation. I thought this might be some weird physical vs virtual memory issue where MATLAB's virtual addresses were sequential but the corresponding physical addresses got more and more fragmented, but ive tried a) implementing the inner loop in a separate function, and b) forcing the data into new memory addresses at every outer loop iteration using the following at the start of each outer loop iteration:
clear dataOld; dataOld=zeros(size(dataCur),'like',dataCur)+ones(1,1,class(dataCur))-ones(1,1,class(dataCur)); dataOld=dataCur; clear dataCur; dataCur=zeros(size(dataOld),'like',dataOld);
If that doesnt force the data to be deep copied to new addresses then IDK what would.
.
OTHER INFO
CPU is a i9-7940x. This has AVX-512, which MATLAB doesnt seem to understand. For example, running `version -blas` tells me the CNR branch is unknown (this usually tells you if it uses AVX/AVX2/SSE/etc.) MKL is definitely using AVX512, but I figure this might possibly have something to do with it? (mainly because nothing else I try seems to have any effect)
.
Any ideas for a fix that dont involve "waiting for an undetermined number of MATLAB versions unless it works right out-of-the-bpx" would be much appreciated.
Thanks in advance.
  2 commentaires
Anthony Barone
Anthony Barone le 5 Oct 2018
Modifié(e) : Anthony Barone le 5 Oct 2018
Happens on both R2018a and R2018b.
It is perhaps worth noting that the data are complex, so maybe it is something with the interleaved complex api? If this seems even remotely possible I can install an older version and try it. Id think this would make it easier, since MKL uses interleaved complex values, but who knows...

Connectez-vous pour commenter.

Réponses (1)

James Tursa
James Tursa le 5 Oct 2018
Modifié(e) : James Tursa le 5 Oct 2018
You should not store your slices as follows, since this forces each slice to be scattered in memory and does not make efficient use of the cache when accessing the slices. It also forces a deep data copy every time you access the slices:
dataOut(kkOuter,:,:)
Rather, you should be storing your slices this way, since this forces each slice to be contiguous in memory and makes efficient use of the cache when accessing the slices:
dataOut(:,:,kkOuter)
Complex data comments:
The BLAS complex matrix multiply routines require interleaved data. Since R2018a and later also use this format, there will be no data copy required for this particular reason. However, there may be a data copy required when accessing slices of a matrix (rather than the entire matrix). There are indications in other Answers threads that MATLAB may implement methods that do not require data copies in some circumstances when the slices are in the first two dimensions as recommended above, but the rules for this are not published to my knowledge. You could of course write mex code that calls the BLAS routines directly to do the matrix multiplies and avoid any data copying in this case, but I don't think any of the FEX submissions that do this are updated for R2018a and later yet (including my own submission).
  9 commentaires
Anthony Barone
Anthony Barone le 5 Oct 2018
dataCur is symmetric, but not hermitian
Basically, dataCur is recorded ground motion from, a seismic wavefield where the (reordered) dimensions are [sourceLocationIndex, receiverLocationIndex, timeIndex]. The data are made symmetric in the source/receiver dimensions using an idea called reciprocity, which basically states that (in the high/infinite frequency limit) you record the same wavefield going from source@A-->receiver@B as from source@B-->receiver@A, since the path the wavefield takes is the same. This means data(a,b,c) and data(b,a,c) should be the same - if one is missing then the other uses the same value as the one you have, otherwise you average them.
Put another way, taking the transpose flips the positions in space, taking the conjugate flips the positions in time, and taking the hermition conjugate flips both. In this case I want the spatial positions flipped but I still want data from a wavefield moving forward in time.
Just to be 100% sure, I looked at the handful of data points where I actually had data in both data(a,b,:) and data(b,a,:). If these should be symmetric then their imaginary parts should subtract to zero. If these should be hermition then the imaginary parts should add to zero. This is what I found. Its real data (and not particularly good data) so things dont cancel perfect, but the residuals are substantially lower for the transpose case than the hermition conjugate case.
I know which MKL routines you are reffering to, but unfortunately, I dont think that MKL supports the accelerated BLAS routines you are talking about for symmetric complex values. It does for symmetric reals and hermition conjugates, but I dont think symmetric complexes will work. Ill double check though. Maybe 2019.0 added support for these...
.
As far as N and M: it depends, but my current test case is quite small by seismic data standards and has N=2000 and M=2049. This is probably standard for M (seismic data very frequently has 3000 time samples, so zero padding up to the next power of 2 and dropping the negative frequencies gives 2049). This is close to a lower limit for N though. Realistically, N will be as large as the machine can support. I only have 128 gigs in my personal rig that Im currently running this on, but dual-socket xeons with 512gb-1tb are fairly common, and there are clusters with significantly more than that. So, idk, anywhere in the range of M=1,000 to M=40,000?
Anthony Barone
Anthony Barone le 8 Oct 2018
So, I think i may have figured out the problem. I noticed that a fair number of the environment variables that `mklvars.sh intel64 mod ilp64` set were being silently overwritten / removed during startup. If I re-set these using `setenv` things seem to be quite a bit more stable. I also added an enviroment variable that should dictate the threading affinity.
Specifically, Im using the following command.
alias mymatlab='/opt/intel/compilers_and_libraries/linux/mkl/bin/mklvars.sh intel64 mod ilp64 verbose && export "BLAS_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "LAPACK_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "MKL_THREADING_LAYER=INTEL" && export "MKL_INTERFACE_LAYER=ILP64" && export "KMP_AFFINITY=granularity=fine,compact,1,0" && export "MKL_NUM_THREADS=14" && export "OMP_NUM_THREADS=14" && export && export "MKL_CBWT=AVX512" && cd ${HOME}/Documents/MATLAB && "/usr/local/MATLAB/$(ls /usr/local/MATLAB | sort -g | tail -n 1)/bin/matlab" -r "setenv('\''LD_LIBRARY_PATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'\'','\'':'\''*~isempty(getenv('\''LD_LIBRARY_PATH'\'')),getenv('\''LD_LIBRARY_PATH'\'')]); setenv('\''LIBRARY_PATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'\'','\'':'\''*~isempty(getenv('\''LIBRARY_PATH'\'')),getenv('\''LIBRARY_PATH'\'')]); setenv('\''NLSPATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/locale/%l_%t/%N'\'','\'':'\''*~isempty(getenv('\''NLSPATH'\'')),getenv('\''NLSPATH'\'')]); setenv('\''CPATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include/intel64_lin/ilp64:/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include'\'','\'':'\''*~isempty(getenv('\''CPATH'\'')),getenv('\''CPATH'\'')]); setenv('\''PKG_CONFIG_PATH'\'',['\''/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/bin/pkgconfig'\'','\'':'\''*~isempty(getenv('\''PKG_CONFIG_PATH'\'')),getenv('\''PKG_CONFIG_PATH'\'')]); setenv('\''MKL_NUM_THREADS'\'','\''14'\'')"'
That said ive made quite a few other changes to the code over the past few days, but this kind of makes sense that it might cause this (since I doubt MATLAB's build in version of libraries like libiomp5.so know how to deal with AVX-512), but thats just a guess on my part.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Write C Functions Callable from MATLAB (MEX Files) dans Help Center et File Exchange

Produits


Version

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by