MATLAB R2024b GPU validation device fail for Multi-Instance GPU (MIG) A100
Afficher commentaires plus anciens
We are currently installing MATLAB R2024b on our HPC cluster. The instillation works beautifully across all of our GPUs except an A100 that utilizes NVIDIA's Multi-Instance GPU (MIG). When I launch a CLI session using
matlab -nodesktop -nodisplay -nosoftwareopengl
and run "validateGPU", I receive the following error: "Encountered error when calling NVML. The NVML error was: Invalid Argument."
The same sequence does not produce an error when ran on one of our other A100 GPUs with the same Driver and CUDA version. In our MATLAB version R2023b we do not receive this error with our MIG GPU and it is able to run GPU code successfully. Could someone please let me know if MATLAB R2024b is able to run on A100 GPUs with MIG and if it can, what the issue might be?
For robustness, here is the full output:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:21:00.0 Off | On |
| N/A 29C P0 32W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB On | 00000000:81:00.0 Off | On |
| N/A 28C P0 33W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB On | 00000000:E2:00.0 Off | On |
| N/A 28C P0 34W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 2 0 0 | 38MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Launching a session and attempting to validate the GPU:
matlab -nodesktop -nodisplay -nosoftwareopengl
< M A T L A B (R) >
Copyright 1984-2024 The MathWorks, Inc.
R2024b Update 2 (24.2.0.2773142) 64-bit (glnxa64)
October 22, 2024
To get started, type doc.
For product information, visit www.mathworks.com.
Warning: OpenGL Startup options will be removed in a future release.
>> validateGPU
# Beginning GPU validation
# Performing system validation
# CUDA-supported platform .................................................PASSED
# CUDA-enabled graphics driver exists .....................................PASSED
# Version: 550.90.12
# CUDA-enabled graphics driver load .......................................PASSED
# CUDA environment variables ..............................................PASSED
# CUDA_VISIBLE_DEVICES: "0"
# CUDA device count .......................................................PASSED
# Found 1 devices.
# GPU libraries load ......................................................PASSED
#
# Performing device validation for device index 1
# Device exists ...........................................................FAILED
# Encountered error when calling NVML. The NVML error was:
# Invalid Argument.
#
# Device supported ........................................................SKIPPED
# Device available ........................................................SKIPPED
# Device selectable .......................................................SKIPPED
# Device memory allocation ................................................SKIPPED
# Device kernel launch ....................................................SKIPPED
# Finished GPU validation with 1 failures.
Output using "coder.checkGpuInstall":
>> gpuEnvObj = coder.gpuEnvConfig;
>> gpuEnvObj.GpuId = 0;
>> gpuEnvObj.BasicCodegen = 1;
>> gpuEnvObj.BasicCodeexec = 1;
>> results = coder.checkGpuInstall(gpuEnvObj)
Compatible GPU : FAILED (There is a problem with the graphics driver or with this GPU device. Code execution will not be available. Check that you have a supported GPU and the latest graphics driver.)
CUDA Environment : FAILED (Unable to execute the nvcc command. Check your CUDA Toolkit installation.)
Runtime : PASSED
cuFFT : PASSED
cuSOLVER : PASSED
cuBLAS : PASSED
Host Compiler : PASSED
results =
struct with fields:
gpu: 0
cuda: 0
cudnn: 0
tensorrt: 0
hostcompiler: 1
basiccodegen: 0
basiccodeexec: 0
deepcodegen: 0
tensorrtdatatype: 0
deepcodeexec: 0
Réponses (1)
Joss Knight
le 13 Jan 2025
0 votes
Try running nvidia-smi -L in a terminal to get the UUID of the device, and then set CUDA_VISIBLE_DEVICES to that full UUID instead of the device index, following the advice in the Knowledge Article here. I'm not sure device index works properly with MIG in CUDA 12.
Do you have one A100 divided into 3 or 3 A100s with one in MIG mode? If the latter I think something is wrong, your driver should not be able to see anything but the MIG device.
14 commentaires
Brandon
le 16 Jan 2025
Brandon
le 16 Jan 2025
Joss Knight
le 16 Jan 2025
When my A100 was in MIG-mode, I wasn't able to select a second device, but perhaps this behaviour has improved in CUDA 12. I'll look into it.
In the meantime, ignoring the output of validateGPU, have a look a whether you can select the device by calling gpuDevice(1), gpuDevice(2) etc. I'm concerned that we might have an internal issue with the way that NVML expects device indices to be specified, in the sense of not counting from zero to (number of visible devices) but instead going by PCI slot index. We have recently found a separate bug that is potentially related.
Brandon
le 16 Jan 2025
Joss Knight
le 16 Jan 2025
Never mind, I can reproduce this. There is clearly a bug introduced in R2024a. Thanks for reporting it.
I am just trying to determine whether this bug is limited to multi-GPU machines. I note also that if you set CUDA_VISIBLE_DEVICES to your non-MIG devices, they can be selected, but the MIG instances cannot.
Brandon
le 16 Jan 2025
Walter Roberson
le 16 Jan 2025
Sometimes fixes like this are made available in Updates, but other times fixes need to wait for the next release.
Joss Knight
le 16 Jan 2025
We'll see. We might be able to find a workaround that allows you to continue using MIG without waiting for a new release. If we deem it to be a bug that multiple users are likely to encounter, we'll publish it and any workarounds in our bug reporting system and you can track its progress. I will also update you here.
Brandon
le 16 Jan 2025
Joss Knight
le 14 Fév 2025
It seems that the workaround for this is to avoid selecting the GPU device. Creating and using gpuArrays works but querying device properties does not, so gpuDevice, gpuDeviceTable, vaidateGPU, canUseGPU will all error.
Can you try this and see if this solves your problem?
Chia-Hao Lee
le 10 Juin 2025
Hi, I came across this post and just want to echo on this issue.
Our HPC has a similar setup with A100 nodes being splitted into MIG devices. Out Matlab 2021a (with CUDA 11.8) works fine with these A100 MIGs, but the Matlab 2024a (CUDA 12.2) is having issues recognizing gpuDevice and returns:
> Error using gpuDevice (line 26)
> Encountered error when calling NVML. The NVML error was:
> Invalid Argument.
gpuDeviceCount() did return 1, but parallel.internal.gpu.selectDevice([]) would return [].
While some simple computation like
try
a = gpuArray(rand(10));
b = a + a;
result = gather(b(1,1));
fprintf('GPU computation works! Result: %f\n', result);
catch ME
warning('GPU computation failed: %s', ME.message);
end
did work, but it is nearly impossible to convert an existing package to drop all usage of gpuDevice.
I'll be very interested if there's any solution or bug-fix timeline of this.
Joss Knight
le 10 Juin 2025
Modifié(e) : Joss Knight
le 10 Juin 2025
Hi! This bug is fixed in R2025a. See the bug report
Brandon
le 12 Juin 2025
Catégories
En savoir plus sur GPU Computing dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!