CUDA_ERROR_LAUNCH_FAILED when training large networks

I have trained networks (trainNetwork()) on my GPU with MATLAB R2018b for over a year without any issues.
Since when I upgraded to MATLAB R2020b, I've only been able to train small networks. The same script that would run flawlessly in R2018b with an arbitrarily large number of units (e.g., n = 2000), in R2020b works up until n = 50, and then crashes for (n > 100).
The reported error is typically:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Error using trainNetwork (line 183)
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Error in RNNprediction (line 170)
net = trainNetwork({traind.x}, {traind.y}, layers, options);
The crash happens between the 2nd and 5th training iteration. When this happens, I have to restart MATLAB in order to be able to do any training at all since reset(gpuDevice) also fails and returns:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The CUDA error was:
all CUDA-capable devices are busy or unavailable
Training of the same network runs smoothly on CPU (although very slowly).
NOTE: I have already increased the WDDM TDR Delaty to 60, but nothing has changed. I have also tried disabling altoghether the TDR with no success.
Here are some CUDA properties:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce RTX 2070'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 10.2000
ToolkitVersion: 10.2000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
MultiprocessorCount: 36
ClockRateKHz: 1620000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1

2 commentaires

Hi Alessandro! What GPU do you have?
AS
AS le 14 Déc 2020
Hi Andrea, I have an NVIDIA GeForce RTX 2070.

Connectez-vous pour commenter.

 Réponse acceptée

AS
AS le 14 Déc 2020
Modifié(e) : AS le 14 Déc 2020

1 vote

This issue seems to be specific to the training of recurrent neural networks. Following https://www.mathworks.com/matlabcentral/answers/485733-cuda-crashes-when-training-lstm-on-geforce-rtx-2080-super, I have fixed my issue by installing R2020a, with CUDA toolkit 10.1 and NVIDIA Studio Driver Version 431.86 WHQL (https://www.nvidia.com/Download/driverR ... 1050/en-us).

Plus de réponses (0)

Catégories

En savoir plus sur Deep Learning Toolbox dans Centre d'aide et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by