CUDA_ERROR_LAUNCH_FAILED when training large networks

Question

0 votes

I have trained networks (trainNetwork()) on my GPU with MATLAB R2018b for over a year without any issues.

Since when I upgraded to MATLAB R2020b, I've only been able to train small networks. The same script that would run flawlessly in R2018b with an arbitrarily large number of units (e.g., n = 2000), in R2020b works up until n = 50, and then crashes for (n > 100).

The reported error is typically:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
        
Error using trainNetwork (line 183)
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
        
Error in RNNprediction (line 170)
net = trainNetwork({traind.x}, {traind.y}, layers, options);

The crash happens between the 2nd and 5th training iteration. When this happens, I have to restart MATLAB in order to be able to do any training at all since reset(gpuDevice) also fails and returns:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED 
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The CUDA error was:
all CUDA-capable devices are busy or unavailable

Training of the same network runs smoothly on CPU (although very slowly).

NOTE: I have already increased the WDDM TDR Delaty to 60, but nothing has changed. I have also tried disabling altoghether the TDR with no success.

Here are some CUDA properties:

>> gpuDevice
ans = 
  CUDADevice with properties:
                      Name: 'GeForce RTX 2070'
                     Index: 1
         ComputeCapability: '7.5'
            SupportsDouble: 1
             DriverVersion: 10.2000
            ToolkitVersion: 10.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 8.5899e+09
       MultiprocessorCount: 36
              ClockRateKHz: 1620000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1