GPU optimization of looped vector operations
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000x10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
What is the best strategy to optimize this problem for the GPU?
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
%canUseGPU = false;
if canUseGPU
%integral using GPU
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
dN_KK =gather(gdN_KK);
else
%integral using GPU
parfor i = 1:len
dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
end
end
% Scales data
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc
4 commentaires
Joss Knight
le 29 Août 2019
Lloyd, have you tried running your computation in single precision? Your Quadro P5200 has respectable single precision performance of about 8 or 9 teraflops, but like most graphics cards except special ones, its double precision performance is a small fraction of that at about 280 gigaflops (figures from the Wikipedia page where NVIDIA post their specs). This is why you're not getting much better matrix multiply performance out of your GPU than your CPU - this would be dramatically different in single. It is perfectly normal for an algorithm to run faster on the CPU than on one of these graphics-focussed cards, especially if it is an algorithm with a lot of unvectorized loops and a multiprocess parallelization.
Réponse acceptée
Matt J
le 28 Août 2019
Modifié(e) : Matt J
le 6 Sep 2019
This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks
It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
wait(gd)
toc %Elapsed time is 2.027665 seconds.
15 commentaires
Matt J
le 6 Sep 2019
Modifié(e) : Matt J
le 6 Sep 2019
Okay, I did make a few fixes, but now to be sure we're on the same page, I share the test code below, and I see strong agreement between the two versions
nubar_low = 2450;
nubar_high = 2451;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
%%%% ORIGINAL %%%%%
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
version1 = gdN_KK ;
%%%% OPTIMIZED %%%%%%
chunksize=5;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
%wait(gd)
toc %Elapsed time is 2.027665 seconds.
version2 = gdN_KK ;
plot(1:len,version1,'-',1:len,+version2,'x'); legend('Lloyd','Matt')
Plus de réponses (1)
Lloyd Bumm
le 6 Sep 2019
Modifié(e) : Lloyd Bumm
le 6 Sep 2019
3 commentaires
Joss Knight
le 6 Sep 2019
Thanks. The only explanation for that is that your cost is all overhead on the GPU, and not computation.
Voir également
Catégories
En savoir plus sur Parallel Computing Fundamentals dans Help Center et File Exchange
Produits
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!