Diagnosing parallelization bottlenecks./Differences between Intel and AMD parallel computing performance?
Afficher commentaires plus anciens
I am trying to pinpoint/diagnose a parallel computing bottleneck that I've encountered on two different computers. For the computation each worker within the ‘parfor’ loop is assigned one sparse array out of 101 total (each array’s ‘full’ size is approximately 50,000x250). Each worker: 1) turns the sparse array into a ‘full’ array, 2) convolves the array with a small Gaussian kernel (which is also passed into the worker), 3) performs ICA using the ‘fast_ica’ function – the output is the independent component weight matrix. Recently, I started working on a new computer with a substantially higher core count but the performance of this code seems to be hitting some bottleneck such that I am not seeing any further performance increases. The old system has an Intel i7-8700K CPU (6 physical/12 logical cores), the new system is an AMD Ryzen 9 5950X (16 physical/32 logical cores) – both systems have 64 gb RAM, both are running Windows 10 and both have hyper-threading/SMT enabled (the old system is running Matlab R2018b and the new one is running R2020b). In order to compare the parallel performance across systems I ran the same code on both computers using different numbers of workers:

Top row shows the result for the older Intel system and the bottom row for the newer AMD system. Left column left axis shows the total execution time of each of the parfor runs (as measured by tic/tocs before and after), left column right axis shows the difference in execution time using N vs N+1 workers (i.e. points near 0 mean no improvement from N+1 as compared to N workers) - vertical dotted lines show the # of physical cores. Right columns show the system resource utilization during each of these runs. What I noticed is that in both cases using more than about 8 or 9 workers does not improve performance. This is despite the fact that a) more RAM and CPU resources are being used, and b) 9 workers represent 150% of the physical cores (75% of logical cores) in the Intel system but only 56% of the physical cores (28% of the logical cores) in the AMD system. The decreasing benefits of multi-threading past the physical core count can’t be at issue here given that the ‘bottleneck’ occurs well below the physical core count of the AMD system (16) and well above it on the Intel system (6). To me the most interesting ‘clue’ is that the number at which no further improvement occurs seems to be about 8 or 9 for both systems – however, its not impossible this is a coincidence and I don’t know quite how to interpret this fact. So my questions are:
- Given the difference in CPU memory architecture are there known differences in parallel computing performance between Intel and AMD Ryzen CPU’s?
- Given that the few physical limitations that I looked at (RAM, CPU utilization, physical core count) do not seem to be the problem, what else is likely to be bottlenecking me here?
- How can I further diagnose the source of the bottleneck (in terms of potential answers to question 2, or more generally)?
6 commentaires
Walter Roberson
le 17 Mar 2021
Did you ensure that the amd system is using the fast code path? https://www.reddit.com/r/matlab/comments/frrhnv/matlab_r2020a_fixes_codepath_usage_on_amd_cpus/
Edric Ellis
le 17 Mar 2021
Hm, this definitely feels like you're hitting a resource limit somewhere. I'm no expert on CPU architectures, but I took a quick look at https://en.wikichip.org/wiki/amd/ryzen_9/5950x vs. https://en.wikichip.org/wiki/intel/core_i7/i7-8700k - and one thing that I notice is that the AMD has only moderately higher memory bandwidth than the Intel chip. So, it is possible that memory bandwidth is the limiting factor (I don't know of a way to prove that though). You could consider doing something a bit like this:
spmd
t = zeros(1,numlabs);
for nw = 1:numlabs
labBarrier();
timer = tic();
if labindex <= nw
% Only run on the first nw workers
for idx = 1:10
performCalculation();
end
end
labBarrier();
t(nw) = toc(timer);
end
end
t{:}
The aim here is to perform the timing without any of the parfor potential overheads getting in the way. I'm sort-of expecting the iteration time to increase the more workers are contending. The other thing you could try is using mpiprofile to run the MATLAB profiler on the workers to see if there's a particular part of your computation that is getting slower as more contention is involved. I.e. with a couple of different pool sizes, try:
mpiprofile on
spmd
for idx = 1:10
performCalculation();
end
end
mpiprofile viewer
Réponses (1)
Catégories
En savoir plus sur Parallel for-Loops (parfor) dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
