MATLAB Answers


Allocating pinned memory in matlab mex with CUDA

Asked by Ander Biguri on 18 Feb 2019
Latest activity Commented on by Ander Biguri on 13 Aug 2019
I have an application where I call my own CUDA fucntions from a mex. However, the memory transferred can be very big (both input and output) and that means that pinned memory can help me speed up the process quite a lot.
I have seen several posts in teh internet and in hre mentioning that you can not use pinned memory (cudaMallocHost) with MTALAB variables, however all these are from 2017 or older. Now that we are in 2019, and the parallel computing toolbox, CUDA and MATLAB have changed a lot, is this still true? Can pinned memory not be used still? For applications where memory is critical this is a big drawback.


Hi Matt,
Indeed that would be an option. This was not implemented for design reasons, and perhaps it is now a bit too late to restructure the entire toolbox, but definetly that would be a solid option to minimize the trasnfer times.
In any case, for most algorithms and uses of TIGRE, specially when the data is big, the transfer times are just a small fraction of the computational time, so its just a small theoretical maximum improvement that can be achieved if TIGRE would return gpuArray objects.
Perhaps I will test this more scientifically at some point to give numbers of how much, but I think it does not exceed 10%, and for industrial sized datasets, not even 1%.
In any case, for most algorithms and uses of TIGRE, specially when the data is big, the transfer times are just a small fraction of the computational time
I'm not sure which algorithms you had in mind here, but performance will definitely suffer for ordered subset algorithms if you have to do a transfer after every forward/back projection of a subset. The total data set may be large, but the size of a subset can be small in comparison, and the more subsets you have, the more transfers you will have to do. If I were to undertake the task of creating dedicated gpuArray versions of the forward/back projection modules only, are you saying it would be highly challenging task?
Hi Matt,
You are absolutely right. In fact, a small test that I did not long ago showed that particularly for SART (which updates images projection by ptojection), an acceleration of x10 is expected if the memory trasnfer is removed and all the data is kept in the GPU.
For industrial/scientific sizes of images, SART would still be very slow and not recoomended. For medical images, this improvement may be very welcomed.
Now, about modifying TIGRE: it may be a challenging task.
Recenltly I updated TIGRE ( to work with multi-GPUs where the trasnfer to CPU may be required, as TIGRE now will break up the problem in chuncks if it does not fit the GPU, thus allowing for recosntruction bigger than before. Modifying this version is quite a huge workload as it would require quite big changes in the CUDA side, as there is a lot of memory management involved.
However, modifying the older single-GPU version will likely be considerably easier. Some changes in the CUDA code will be required (as its who passes memory in and out of the GPU), but there are just few lines to do the job. If you were to modify it to have dedicated gpuArrays and succeed, we could find a way to add it to the TIGRE code, and I could add some logic for when to use each of the versions (depending on problem size, number of GPUs, etc). If you are up for the task, please feel free to email me and we can discuss it further.

Sign in to comment.

0 Answers