gather takes really long after using ptx file /CUDA

Question

0 votes

I try to make a matrixmultiplication using CUDA via ptx file to take advantage over the matlab internal functions. My .cu codes can calculate the matrixmultiplication faster than matlabs internal function, but the gather command after running the kernel takes much longer than after using matlab commands and gpuArray before:

Here my matlab file to compare both:

    g=gpuDevice
    reset(g)
  clear all
N=1024;
A=rand(N,N);
B=rand(N,N);
%gpuDevice using Matlab
A_gpu=gpuArray(A);%Create array on GPU
B_gpu=gpuArray(B);%Create array on GPU
tic
C_gpu=A_gpu*B_gpu;
toc
tic
C=gather(C_gpu);%copy array from GPU to CPU
toc
%now using CUDA
A=A';
a_gpu=gpuArray(A(:)');%Create array on GPU make vector
b_gpu=gpuArray(B(:)');%Create array on GPU make vector
c_gpu=gpuArray(zeros(N*N,1));
k = parallel.gpu.CUDAKernel('matrixmul.ptx', 'matrixmul.cu');
k.ThreadBlockSize = [N,1,1];
k.GridSize=[N,N];
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
toc
tic
c2=gather(o);%back to host
toc
%check
max(max(abs(C-c2)))

My .cu file looks like this: _global_ void matrixmul( double *c, double *a, double *b) { _shared_ double cache[1024]; int cacheIndex = threadIdx.x;

  int Aind=threadIdx.x + blockIdx.x * gridDim.y;
  int Bind=threadIdx.x + blockIdx.y * gridDim.x;
        cache[cacheIndex]=a[Aind]*b[Bind];
  __syncthreads();
  int i=blockDim.x/2;
  while (i != 0) {
    if (cacheIndex<i)
      cache[cacheIndex]+=cache[cacheIndex+i];
    __syncthreads();
    i/=2;
  } 
  if (cacheIndex == 0)
    c[blockIdx.y *gridDim.y + blockIdx.x ]=cache[0];
  
}

In my version i use directly vectors instead of matrices and I transposed the 2 Matrix before starting the calculation to take advantage of the better order inside the vector for the memory access:

Thats what I get back: Elapsed time is 0.110911 seconds. Elapsed time is 0.007010 seconds. Elapsed time is 0.001937 seconds. Elapsed time is 3.651635 seconds.

ans =
     1.0800e-12

As you see the first gather command takes only 0.007 seconds while the second one needs more than 3sec. Also if I put all my calling stuff into some function, also the call of this function takes a lot (without even reading the gpuArray.

Any suggestions whats going wrong here?

Thanks

Robert

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

James Lebak le 24 Jan 2013

Ouvrir dans MATLAB Online

1 vote

In MATLAB R2012a and later, GPU functions execute asynchronously in MATLAB. To get accurate timings, you need to call the wait function to make sure that gpu execution is finished. To accurately measure the time taken by MATLAB's multiply or by your kernel, rewrite your code as follows:

tic
C_gpu=A_gpu*B_gpu;
wait(g); % g is the value returned by gpuDevice, above
toc
tic
[o] = feval(k, c_gpu,a_gpu,b_gpu);
o=reshape(o,N,N);
wait(g); % g is the value returned by gpuDevice, above
toc

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

gather takes really long after using ptx file /CUDA

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponses (1)

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Catégories

Tags

Community Treasure Hunt

gather takes really long after using ptx file /CUDA

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponses (1)

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Catégories

Tags

Voir également

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens