Using a "CUDAKernel" type object within a parfor loop
Afficher commentaires plus anciens
Hello,
Thanks in advance for your help.
I'm trying to make use of a CUDAKernel object from within a parfor loop, but when I do so I am met with the message:
Warning: Cannot load an object of class 'CUDAKernel':
No matching constructor signature found.
> In parallel.internal.pool.optionallyDeserialize (line 7)
In parallel.internal.parfor.cppRemoteParallelFunction (line 25)
As a workaround for this issue, I have tried passing in a string which contains the name of the kernel as follows:
cudaKernel = parallel.gpu.CUDAKernel(cudaKernelName+".ptx", cudaKernelName+".cu");
However, directly constructing the kernel in each iteration of the for loop greatly reduces performance.
I am wondering, is there any efficient way that I can pass an existing CUDAKernel object into a parfor loop?
Joseph
Réponse acceptée
Plus de réponses (1)
Aditya Singh
le 22 Juin 2023
Modifié(e) : Aditya Singh
le 22 Juin 2023
Hello Joseph,
As per my understanding, you are facing an issue in using CUDAKernel in parfor loop, so you tried a workaround. But in this approach, you are making a kernel instance each time, which you want to avoid.
The workaround for not having to create a kernel every time is to utilize a pre-compiled kernel. This approach assumes that the kernel code undergoes minimal modifications with each iteration.
Suppose that you have a CUDA kernel called `vectorAddKernel` that adds two vectors of size `n` and writes the result to a third vector of the same size. The kernel code is presented below:
% Define the kernel source code with placeholders for input parameters
kernelSource = ['#include <cuda.h>\n' ...
'template <int n>\n' ...
'extern "C" __global__ void vectorAddKernel(float* A, float* B, float* C)\n'
'{\n' ...
' int i = threadIdx.x + blockDim.x * blockIdx.x;\n' ...
' if (i < n) {\n' ...
' C[i] = A[i] + B[i];\n' ...
' }\n' ...
'}'];
% Define the kernel parameters
n = 1024;
numIterations = 10;
% Compile the templated kernel code to PTX code using nvcc and then load
% the file
kernelPTX = fileread('vectorAddKernel.ptx');
compiledKernel = parallel.gpu.CUDAKernel(kernelPTX, 'vectorAddKernel');
% Inside the parfor loop, call the compiled kernel with the necessary arguments
parfor idx = 1:numIterations
% Call kernel with arguments
blockSize = [256,1,1]; % block size
gridSize = [ceil(n/256),1,1]; % grid size
output{idx} = zeros(n, 1, 'single', 'gpuArray');
feval(compiledKernel, blockSize, gridSize, {gpuArray(single(rand(n, 1))), gpuArray(single(rand(n, 1))), output{idx}});
end
For reference, kindly see:
- Kernel executable on GPU - MATLAB - MathWorks India
- Decide When to Use parfor - MATLAB & Simulink - MathWorks India
- Nested parfor and for-Loops and Other parfor Requirements - MATLAB & Simulink - MathWorks India
Hope this helps!
4 commentaires
Joss Knight
le 22 Juin 2023
Aditya, this won't work, you need to compile code to PTX to create a CUDAKernel object.
Aditya Singh
le 22 Juin 2023
Thanks for pointing it out, I changed the sample snippet accordingly.
Joss Knight
le 22 Juin 2023
You can directly compile to PTX from MATLAB using mexcuda with the -ptx option. You should probably add that (and check that the code runs).
Joseph DeCunha
le 22 Juin 2023
Catégories
En savoir plus sur GPU Computing dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!