mex CUDA code to calculate Coulomb field

1 view (last 30 days)
Hello, I've been trying to CUDA-ify a mex code I had written for calculating the electric field produced by M point charges whose coordinates are stored in an Mx3 array of [x,y,z]'s. The field is calculated at a set of points specified by a similar Nx3 array.
I had used nested for loops, of which the outer ran over the coordinates of the charges, and the inner - those of the points of interest.
Thus for each charge, with coordinates [xc,yc,zc] the field would be calculated at each location, such that the output of the inner loop would produce Ex, Ey, Ez - so that size(Ex) = size(Ey) = size(Ez) = Nx3.
Each field component generated by the next charge would then be added up in the outer loop, so that the total field created at each point would come out of the outer loop at the end. So something like:
if true
% code
for(i = 0; i<n_charges ;i++) {
for (l = 0; l < length_of_coord_list; l++) {
R[0] = vecr[l] - qxyz[i];
R[1] = vecr[l+length_of_coord_list] - qxyz[i+n_charges];
R[2] = vecr[l+length_of_coord_list*2] - qxyz[i+n_charges*2];
R2 = R[0]*R[0]+R[1]*R[1]+R[2]*R[2]; /* calculate distance */
R3 = pow(sqrt(R21),3); /*calculate R^3*/
Ex = Ex+Q[0]*(R[0]/R3);
Ey = Ey+Q[0]*(R[1]/R3);
Ez = Ez+Q[0]*(R[2]/R1);
vro[ i] = Ex;
vro[ l+length_of_coord_list] = Ey;
vro[l+length_of_coord_list*2] = Ez;
where "vecr" is the list of coordinates where the field is measured, "qxyz"- the coordinates of the charges, R[0], R[1] and R[2] - the x, y and z components of the vector connecting a given point to the given charge. Q[0] is the charge (a constant); "vro" is the output of the mex function.
Now it seems to me that this code would greatly benefit from running on a GPU instead of a CPU. The GPU in question is an "NVidia 650m" on a Macbook Pro Retina.
I have managed to compile and successfully run an example code taken from and thought I would build on that. However being completely new to CUDA programming I don't have a good feel for how to proceed, especially with loops. So here are the main questions I have:
1) After some searching I came across a couple of posts where the authors advise against directly programming nested loops in CUDA, and suggest to instead run the outer loop on the CPU and the inner (the one with more iterations, and hence that which will benefit more from parallelization) on the GPU. It seems that this method would also result in a more transparent code, which is a definite plus for a CUDA beginner.
2) I am trying to set it up to run the loop over the charge coordinates on the CPU and the loop over l : for (l = 0; l < length_of_coord_list; l++) on the GPU. However I'm not sure if a loop is necessary at all - from the example and some literature it appears that the iterations of the loop are "taken care of" by the blocks, threads and such (see
3) In the example code there seems to be a one-to-one correspondence between B[i] and A[i], whereas I would need to something less straightforward using the three components of the arrays representing charge positions and coordinates of the field measurement, which is simpler in an explicit loop. So I'm not sure how to handle the two matrices, of dimensions Mx3 and Nx3 respectively. For instance the indexing in
vro[ i] = Ex;
vro[ l+length_of_coord_list] = Ey;
vro[l+length_of_coord_list*2] = Ez;
is such because the Nx3 is converted to 3*Nx1 for the C code, but such convention may not play well with the CUDA kernel? Perhaps I could feed the x, y and z components of the matrices to the kernel as separate 1-D arrays and go from there? 4) I end up having to invoke gpuDevice(1) to reset the device quite often. Otherwise GPU LAUNCH ERRORS pop up. Is this normal (at least for macs with 650m's)?
Anyway, I apologize for the long-winded question. I realize that the answer must be quite simple but I can't seem to find it. Thanks in advance.
P.S. I have looked into bsxfun, arrayfun etc (who mostly choke on big matrices in my experience), but I would like to squeeze every bit of speed out of the machine, so mex+CUDA seems to be the way to go.

Answers (1)

James Lebak
James Lebak on 9 Apr 2013
Edited: James Lebak on 23 Apr 2013
I'm glad to hear that you're using the GPU MEX API. Below are my answers to your questions, which you can accept or reject as you see fit.
1-2) If I understand your problem, you can think of it in two steps: calculate the contributions of each individual charge at each point of interest, and then accumulate the field strength at each point of interest. If you have the memory available, I would be tempted to separate these steps. Do the first one on the GPU as your first CUDA project, and either do the accumulation on the CPU or use MATLAB's sum function on the GPU for the second step. After you're confident that you understand the code to compute the contributions, then you can try integrating the parallel accumulation into your MEX file. There are good references on how to do this type of operation in CUDA but it's not trivial to get the best performance. If you can make MATLAB's sum function, which is overloaded for the GPU, do the work for you, it might make your life easier.
My advice for parallelizing the first step would be to use your MEX file to create an M by N by 3 matrix A, in which the elements (i,j,:) represent the length-3 vector describing the field contribution of charge i at coordinate j. You could parallelize over a total of MN threads and have each thread calculate the three values A(i,j,:). Then I think B=squeeze(sum(A, 1)) would do the second step, producing an N by 3 matrix where the elements in row j are the Ex, Ey, Ez values calculated by your original code.
3) For the purposes of getting coalesced memory accesses, it is likely to be better to have the individual components of the matrices in different 1-D arrays as you suggest.
4) Most importantly, I want to confirm that when developing MEX files in R2013a, you can get a launch failure after a recompile or a clear mex, unless you reset the device first. This is a bug in MATLAB R2013a, and the bug report is here:
Finally, when I ran GPUBench on the Retina MacBook Pro back in August, I observed much better performance from the GPU for single precision than for double. I know you square values for the distance calculation, so you may not be able to use single precision in this example, but if you can you will probably achieve better performance.
James Lebak
James Lebak on 11 Apr 2013
Looking at the last four lines of your Mex code, are you returning the pointer to the correct matrix? I would move the cudaMemcpy to the end and copy into data3 rather than data3f.
You should not need to use nvmex.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by