Alex,
Allow me to quote from the help for parallel.gpu.CUDAKernel and try to parse it for you. If you have suggestions for how to improve the wording, please let me know! "If specified, FUNC must be a string that unambiguously defines the appropriate kernel entry name in the PTX file. If FUNC is omitted, the PTX file must contain only a single entry point"
In your case, the get_nans.cu defines two global functions:
- get_nans<double>
- get_nans<float>
and the get_nans.ptx defines the corresponding two entry points:
- Z16get_nansIdEvPT_PKS0_S3_S3_PKiS5 (For the double function)
- Z16get_nansIfEvPT_PKS0_S3_S3_PKiS5 (For the float function)
When you create the parallel.gpu.CUDAKernel, it is ambiguous whether you want to invoke the double or the float function. Therefore, you must provide the name of the entry point you want to use and construct either the double or the float version:
kDouble = parallel.gpu.CUDAKernel( 'get_nans.ptx', 'get_nans.cu', '_Z16get_nansIdEvPT_PKS0_S3_S3_PKiS5_'); kFloat = parallel.gpu.CUDAKernel( 'get_nans.ptx', 'get_nans.cu', '_Z16get_nansIfEvPT_PKS0_S3_S3_PKiS5_');
Now, this almost works, but not quite because the parser in parallel.gpu.CUDAKernel cannot parse the template function definition. Therefore, we stop using this way of constructing the CUDAKernel:
KERN = parallel.gpu.CUDAKernel(PTXFILE, CUFILE, FUNC)
and use this one instead:
KERN = parallel.gpu.CUDAKernel(PTXFILE, CPROTO, FUNC)
We then end up with:
kDouble = parallel.gpu.CUDAKernel( 'get_nans.ptx', 'double* out, const int* dims', '_Z16get_nansIdEvPT_PKS0_S3_S3_PKiS5_');
kFloat = parallel.gpu.CUDAKernel( 'get_nans.ptx', 'float* out, const int * dims', '_Z16get_nansIfEvPT_PKS0_S3_S3_PKiS5_');
Does this make sense?
Best,
Narfi