Summing array elements seems to be slow on GPU
Afficher commentaires plus anciens
I am testing the times of execution for the following function on CPU and GPU
function funTestGPU(P,U,K,UN)
for k = 1:P
H = exp(1i*K);
HU = U.*H;
UN(k,:) = sum(HU,[1,3]);
end
end
where
,
are complex arrays of size
and Kis a complex array of size
. So in each iteration I perform element-wise exp(), element-wise multiplication of two arrays and summing elements of 3D array along two dimensions.
I test the execution time on CPU and on GPU with the help of the following script
P = 200;
URe = 1/(sqrt(2))*rand(P);
UIm = 1/(sqrt(2))*rand(P);
KRe = 1/(sqrt(2))*rand(P,P,P);
KIm = 1/(sqrt(2))*rand(P,P,P);
% CPU
U = complex(URe, UIm);
K = complex(KRe, KIm);
UN = complex(zeros(P), zeros(P));
fcpu = @() funTestGPU(P,U,K,UN);
tcpu = timeit(fcpu);
disp(['CPU time: ',num2str(tcpu)])
% GPU
U = gpuArray(complex(URe, UIm));
K = gpuArray(complex(KRe, KIm));
UN = gpuArray(complex(zeros(P), zeros(P)));
fgpu = @() funTestGPU(P,U,K,UN);
tgpu = gputimeit(fgpu);
disp(['GPU time: ',num2str(tgpu)])
and I obtain the results
CPU time: 9.0315
GPU time: 3.3894
My concern is that if I remove the last operation from the funTestGPU (summing array elements) I obtain the results
CPU time: 8.0185
GPU time: 0.0045631
So it looks like the summation is the most time-consuming operation on GPU. Is that an expected result?
I wrote the analogical codes in cuPy and in Pytorch and there the summation does not seem to be the most time consuming operation.
I use Matlab 2019b. My graphics card is NVIDIA GeForce GTX 1050 Ti (768 CUDA cores), my processor is AMD Ryzen 7 3700X (8 physical cores).
Réponse acceptée
Plus de réponses (1)
Joss Knight
le 27 Avr 2023
Déplacé(e) : Matt J
le 27 Avr 2023
1 vote
Why are you recomputing H and HU inside the loop? They do not change. If you remove the sum, because the results are never used from the first (P-1) iterations, only the last computation of those values will actually take place.
6 commentaires
Matt J
le 27 Avr 2023
Very strange. I wonder if it is wise to have this "optimization". Essentially, it causes the user's instructions to be disobeyed.
Joss Knight
le 27 Avr 2023
Most people using the GPU want every optimization they can get. For instance, strictly speaking a user who has written
C = A.'*B;
Has requested that A be transposed, but in fact this never happens.
There is no way the user can see the underlying behaviour. All the instructions are recorded and if the user attempts to access the results of any operation it will be computed.
Damian Suski
le 27 Avr 2023
Modifié(e) : Damian Suski
le 27 Avr 2023
Joss Knight
le 27 Avr 2023
Yes. You will get better performance from computing the result in a single sum, but you will probably run out of memory so would have to do it in batches:
function funTestGPU(P,U,K,UN)
HUall = zeros(P,P,P,P,'like',U);
for k = 1:P
H = exp(1i*K);
HUall(:,:,:,k) = U.*H;
end
UN = sum(HUall,[1,3]);
UN = permute(HUall,[4 2]);
end
Damian Suski
le 28 Avr 2023
Damian Suski
le 18 Mai 2023
Catégories
En savoir plus sur Get Started with GPU Coder dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!