Reset GPU & Clear its Memory

I'm running simulations and computations in MATLAB using some reasonably big data sets, and the bulk of the work is done on the GPU. I can only get through about a third of the work I need to do before I receive an error saying the GPU memory is full:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_OUT_OF_MEMORY
I've had this problem for a while, and have tried to get around it by resetting the GPU between each simulation, using any and all of the following:
gpuDevice;
gpuDevice(1);
reset(gpuDevice(1));
wait(gpuDevice(1));
None of these work, neither on their own or combined, nor do they work if I attempt them after my simulations have crashed out. There seems to be no effective way to reset/flush the GPU other than a reboot of my computer.
I'm getting work done this way, but it's slow, and annoying, and means I can't just leave my code running over the weekend as I'd like to - only half of it gets done. I'm sure there must be a way to reset the GPU in MATLAB, and if one of the methods I've tried is correct, what am I doing wrong?
Any ideas?
EDIT: Problem occurs on both R2016a and the R2017a Prerelease.

4 commentaires

Matt J
Matt J le 19 Jan 2017
Modifié(e) : Matt J le 19 Jan 2017
I've been having the same problem with resetting the GPU within MATLAB. Hope there's an easy solution to that.
However, judging by the error messages, it seems like the more pertinent problem is that you are consuming too much GPU memory.
Joss Knight
Joss Knight le 22 Jan 2017
What GPU do you have? GeForce cards and mobile chips that are also driving the display do not behave well when the allocator runs out of memory, sometimes preventing any further kernels from being launched. If you can create a reproduction example then it would make it possible to investigate this.
gpuDevice(1) and reset(gpuDevice) do the same thing. They call cudaDeviceReset, if the device does not recover then I suspect there's not much else you can do because the CUDA driver needs to be reloaded. Again, if you could provide a reproduction then we can take a look to see whether MATLAB can recover better.
On a card running in TCC mode and not driving the display you would typically not expect to get this behaviour. It seems to be an issue with memory corruption because the address space is not correctly divided between compute operations and graphics.
Dan Johnson
Dan Johnson le 23 Jan 2017
Modifié(e) : Dan Johnson le 23 Jan 2017
Thanks for the comments. I'm running a GeForce GTX 960.
I'd love to provide you with an example, but short of copying out my entire codebase I'm not sure what I could post that would be helpful. Here's the code I execute for each data run (I've renamed the functions for clarity):
for m = 1:8
inputVars = CreateVars();
SimulateData(inputVars);
for n = 1:50
[outputVars] = RunReconstruction(inputVars);
save([savePath(m,n)],'outputVars');
end
close all; clear;
end
NOTE: 1. RunReconstruction() gathers the "outputVars" before passing them back. 2. I typically get to m=4 before I get the CUDA error.
Joss Knight
Joss Knight le 20 Juil 2017
I think you're going to have to try to create a minimal reproduction that is a condensed version of your code, otherwise it's impossible to diagnose. Also see below for advise about monitoring your memory usage.

Connectez-vous pour commenter.

Réponses (2)

Joss Knight
Joss Knight le 23 Jan 2017

1 vote

Presumably your simulations are adding results continually to some output variables, which are getting larger and larger. Try gathering your results back to the CPU so that you're not clogging up GPU memory with data that isn't being used for computation any more.

3 commentaires

Dan Johnson
Dan Johnson le 23 Jan 2017
I'm mid-run now, so I'll have a look when that's finished, thank you. Could you clarify for me: you say that it's likely my output variables are getting larger and larger, but I see no reason for that to be the case.
Is the following scenario possible? Say I were to create a variable (call it myVariable) on the GPU, within a simulation function (MySimFunction()), and then make 50 calls to MySimFunction() as part of my data collection. Would MATLAB create 50 separate myVariables in GPU memory, and leave them there until all 50 iterations of MySimFunction() had completed?
My understanding was that it would not, and that the memory for the previous instance would be free for the next, but if that is possible, I think I have my answer.
No, MATLAB releases variables as soon as they are no longer referenced. But it's common for users to run scripts rather than functions, and to aggregate results into a big output array that sits in their MATLAB workspace, e.g.
results(end+1,:) = myNewResults;
Why don't you run your simulation and monitor GPU memory in a separate terminal or command window using nvidia-smi, something like:
nvidia-smi -l 1 -q -d MEMORY
If memory usage is continually going up then you've got some sort of problem with your simulation not releasing variables.
I have a same problem with clear GPU memory: After executing this code, the GPU memory is use by 2 GB. Only the D matrix in GPU memory...
A=fix(gpuArray(rand(1,1000))*99)+1;
B=fix(gpuArray(rand(1,1000))*99)+1;
C=gpuArray(rand(100000,100));
E=C(:,A);
F=C(:,B);
D=E.*F;
clear E F C A B
However, if I execute this code.
D=gpuArray(rand(100000,1000));
There will also be a D matrix (same size) in GPU memory, but now it only use 1 GB of GPU memory. Why is there a difference? and how to clear the memory in the first variant?

Connectez-vous pour commenter.

Remi D
Remi D le 19 Juil 2017

0 votes

I also think there is a problem. I as soon as I call a cuda mex file, running reset(gpuDevice) would throw an error.
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The CUDA error was:
all CUDA-capable devices are busy or unavailable
If I don't try to call reset, I can call again the mex function and it works fine. But as soon as I use reset, the only way to use the GPU is to restart Matlab.
I guess I have to go back to C and leave Matlab in the drawer when I need parallel computing :(

1 commentaire

Joss Knight
Joss Knight le 20 Juil 2017
Modifié(e) : Joss Knight le 20 Juil 2017
If you are using custom MEX functions then we'd have to know more about what they're doing to diagnose. Are you storing state, GPU memory, cufft plans? Are you spinning off threads that are using the GPU? You may need to register a listener to the GPUDeviceManager's DeviceDeselecting event (see the documentation here) in order to respond to a call to reset by tidying up your state or waiting for threads to finish.
Another very common scenario is that your custom MEX function is erroring, perhaps seriously, and you are not checking or clearing up that error. If the next thing you do on the GPU is to call reset, than that will be the first place to detect and report the error. So ensure your mex function ends with something like
cudaDeviceSynchronize();
auto err = cudaGetLastError();
if (err != cudaSuccess) {
mexPrintf("CUDA error: %s\n", cudaGetErrorString(err));
}

Connectez-vous pour commenter.

Catégories

En savoir plus sur Get Started with GPU Coder dans Centre d'aide et File Exchange

Commenté :

le 29 Oct 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by