Optimize GPU code with nested pagemtimes

Hello all,
I'm trying to speed up computation using the GPUs that are available to me. Right now I have two arrays, Q and W.
size(W) = (16 1 1000)
size(Q) = (16 16 1 2000)
I want to do a sudo-matrix multiplication M = W ' *Q*W to get size(M) = (1000 2000).
To do this I use two instances of pagemtimes which is able to utilize GPU. Here's the code
%%
tic
Sar_pm_gpu = zeros(num_psar_kept,2,size(shim_pm_gpu,3),'single','gpuArray');
for n =1:size(W,3)
inter_calc = pagemtimes(Q_gpu,shim_pm_gpu(:,1,n));
Sar_this_shim = squeeze(pagemtimes(shim_pm_gpu_left(:,:,n),inter_calc)); %in a test, this one is ~15% faster
[Sar_maxk, index_maxk] = max(Sar_this_shim);
Sar_pm_gpu(:,:,n)=[Sar_maxk,index_maxk];
end
With this code I get ~5x speedup vs running it on the cpu. However I'd expect it to be quite a bit faster than that. I then used nvidia-smi and the power consumption on the GPU was ~35W. For referance the resting power consumption is 30W so I don't think that this code is actually utilizing the GPU. If anyone sees a way to speed this up it would be much appriciated! (a explaination on why the GPU power consumption is so low with this posted code would also be much appriciated, I assume it has something to do with memory)

2 commentaires

Matt J
Matt J le 28 Juil 2022
You shouldn't be using tic/toc for timing gpuArray operations,
I clipped off the end of the code on accident, I make sure to
gather(output)
before calling toc so it's accurate.

Connectez-vous pour commenter.

 Réponse acceptée

Matt J
Matt J le 28 Juil 2022
I don't think you need either a loop or a second pagemtimes call.
Wr=reshape(W,16,1000);
Qr=reshape(Q,16,16,2000);
M=sum(pagemtimes(Qr,Wr).*Wr,1);
M=reshape(M,1000,2000);

4 commentaires

tiwwexx
tiwwexx le 28 Juil 2022
Modifié(e) : tiwwexx le 28 Juil 2022
That works amazing, verying the 1000 and 2000 dimensions this code gets linearly faster than the original code. @Matt J, Would you mind a brief explaination as to why your code is so much faster than the code below (a referance hyperlink or a way to see what memory calls are being made in these different matlab functions would also be very appricated!)
squeeze(pagemtimes(W,'ctranspose',pagemtimes(Qt,W),'none'))
Would you mind a brief explaination as to why this code is so much faster than the code below.
I don't find it to be. On my machine, it is even a little bit faster.
W=rand(16,1,1000);
Q=rand(16,16,1,2000);
timeit(@()version1(Q,W))
ans = 0.0974
timeit(@()version2(Q,W))
ans = 0.1330
function M=version1(Q,W)
Wr=reshape(W,16,1000);
Qr=reshape(Q,16,16,2000);
M=sum(pagemtimes(Qr,Wr).*Wr,1);
M=reshape(M,1000,2000);
end
function M=version2(Q,W)
M = squeeze(pagemtimes(W,'ctranspose',pagemtimes(Q,'transpose',W,'none'),'none'));
end
Matt J
Matt J le 28 Juil 2022
Modifié(e) : Matt J le 28 Juil 2022
It seems to be slower only on the GPU. pagemtimes isn't well-optimized for the GPU, it would appear.
tiwwexx
tiwwexx le 28 Juil 2022
Hmm, very interesting indeed. I have a feeling that I'm eventually going to need to learn CUDA since I run into these problems quite often...

Connectez-vous pour commenter.

Plus de réponses (0)

Produits

Version

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by