MATLAB R2024b GPU validation device fail for Multi-Instance GPU (MIG) A100

We are currently installing MATLAB R2024b on our HPC cluster. The instillation works beautifully across all of our GPUs except an A100 that utilizes NVIDIA's Multi-Instance GPU (MIG). When I launch a CLI session using
matlab -nodesktop -nodisplay -nosoftwareopengl
and run "validateGPU", I receive the following error: "Encountered error when calling NVML. The NVML error was: Invalid Argument."
The same sequence does not produce an error when ran on one of our other A100 GPUs with the same Driver and CUDA version. In our MATLAB version R2023b we do not receive this error with our MIG GPU and it is able to run GPU code successfully. Could someone please let me know if MATLAB R2024b is able to run on A100 GPUs with MIG and if it can, what the issue might be?
For robustness, here is the full output:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:21:00.0 Off | On |
| N/A 29C P0 32W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB On | 00000000:81:00.0 Off | On |
| N/A 28C P0 33W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB On | 00000000:E2:00.0 Off | On |
| N/A 28C P0 34W / 250W | 75MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 2 0 0 | 38MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Launching a session and attempting to validate the GPU:
matlab -nodesktop -nodisplay -nosoftwareopengl
< M A T L A B (R) >
Copyright 1984-2024 The MathWorks, Inc.
R2024b Update 2 (24.2.0.2773142) 64-bit (glnxa64)
October 22, 2024
To get started, type doc.
For product information, visit www.mathworks.com.
Warning: OpenGL Startup options will be removed in a future release.
>> validateGPU
# Beginning GPU validation
# Performing system validation
# CUDA-supported platform .................................................PASSED
# CUDA-enabled graphics driver exists .....................................PASSED
# Version: 550.90.12
# CUDA-enabled graphics driver load .......................................PASSED
# CUDA environment variables ..............................................PASSED
# CUDA_VISIBLE_DEVICES: "0"
# CUDA device count .......................................................PASSED
# Found 1 devices.
# GPU libraries load ......................................................PASSED
#
# Performing device validation for device index 1
# Device exists ...........................................................FAILED
# Encountered error when calling NVML. The NVML error was:
# Invalid Argument.
#
# Device supported ........................................................SKIPPED
# Device available ........................................................SKIPPED
# Device selectable .......................................................SKIPPED
# Device memory allocation ................................................SKIPPED
# Device kernel launch ....................................................SKIPPED
# Finished GPU validation with 1 failures.
Output using "coder.checkGpuInstall":
>> gpuEnvObj = coder.gpuEnvConfig;
>> gpuEnvObj.GpuId = 0;
>> gpuEnvObj.BasicCodegen = 1;
>> gpuEnvObj.BasicCodeexec = 1;
>> results = coder.checkGpuInstall(gpuEnvObj)
Compatible GPU : FAILED (There is a problem with the graphics driver or with this GPU device. Code execution will not be available. Check that you have a supported GPU and the latest graphics driver.)
CUDA Environment : FAILED (Unable to execute the nvcc command. Check your CUDA Toolkit installation.)
Runtime : PASSED
cuFFT : PASSED
cuSOLVER : PASSED
cuBLAS : PASSED
Host Compiler : PASSED
results =
struct with fields:
gpu: 0
cuda: 0
cudnn: 0
tensorrt: 0
hostcompiler: 1
basiccodegen: 0
basiccodeexec: 0
deepcodegen: 0
tensorrtdatatype: 0
deepcodeexec: 0

Réponses (1)

Try running nvidia-smi -L in a terminal to get the UUID of the device, and then set CUDA_VISIBLE_DEVICES to that full UUID instead of the device index, following the advice in the Knowledge Article here. I'm not sure device index works properly with MIG in CUDA 12.
Do you have one A100 divided into 3 or 3 A100s with one in MIG mode? If the latter I think something is wrong, your driver should not be able to see anything but the MIG device.

14 commentaires

Thank you very much for the reply. I attempted that solution and I still receive a Failure in the "validateGPU" command. I tried it with both the UUID for the GPU and for the MIG instance. In the "validateGPU" output, I see that it is recongnizing the "CUDA_VISIBLE_DEVICES" I provided to it.
We have 3 A100s per node and each of these A100s are split into two 3g.20gb MIG instances. I can look into the driver output and see if our configuration is not quite right and that is causing issues. Thank you for that suggestion. Although it is still strange that MATLAB R2023b works as expected. Do you have any suggestions as to why the older MATLAB version works?
For some context, in the example I provided above I am only requesting 1 MIG instance. This is why only one is veiwable. I am running this from a user perspective, rather than root. So, I should only see the MIG instance I asked for. From what I can tell from the NVIDIA documentation, the output I provided for nvidia-smi is expected for our setup. If I am mistaken, please let me know. Thank you for your time!
When my A100 was in MIG-mode, I wasn't able to select a second device, but perhaps this behaviour has improved in CUDA 12. I'll look into it.
In the meantime, ignoring the output of validateGPU, have a look a whether you can select the device by calling gpuDevice(1), gpuDevice(2) etc. I'm concerned that we might have an internal issue with the way that NVML expects device indices to be specified, in the sense of not counting from zero to (number of visible devices) but instead going by PCI slot index. We have recently found a separate bug that is potentially related.
Joss, thank you for the quick response and your insight into the potential issue, I truly appreciate it.
Here is the output I obtain when using "gpuDevice":
>> gpuDevice(1)
Error using gpuDevice (line 26)
Encountered error when calling NVML. The NVML error was:
Invalid Argument.
>> gpuDevice(2)
Error using gpuDevice (line 26)
Invalid CUDA device id: 2. Select a device id from the range 1:1.
Never mind, I can reproduce this. There is clearly a bug introduced in R2024a. Thanks for reporting it.
I am just trying to determine whether this bug is limited to multi-GPU machines. I note also that if you set CUDA_VISIBLE_DEVICES to your non-MIG devices, they can be selected, but the MIG instances cannot.
Thank you for confirming that there was a bug introduced in R2024a. This is one of the first bugs I've encountered with MATLAB. Out of curiosity, what is the process for resolving these? Would we have to wait until the next version release of MATLAB or would a new version of R2024a/R2024b be released? I only ask so that I can inform my users of a potential timeline.
Yes, you are correct that if the GPU does not have MIG, that it is correctly selected.
Sometimes fixes like this are made available in Updates, but other times fixes need to wait for the next release.
Brandon
Brandon le 16 Jan 2025
Modifié(e) : Brandon le 16 Jan 2025
That makes sense, thank you for your input!
We'll see. We might be able to find a workaround that allows you to continue using MIG without waiting for a new release. If we deem it to be a bug that multiple users are likely to encounter, we'll publish it and any workarounds in our bug reporting system and you can track its progress. I will also update you here.
Fantastic, thank you for the additional information!
It seems that the workaround for this is to avoid selecting the GPU device. Creating and using gpuArrays works but querying device properties does not, so gpuDevice, gpuDeviceTable, vaidateGPU, canUseGPU will all error.
Can you try this and see if this solves your problem?
Hi, I came across this post and just want to echo on this issue.
Our HPC has a similar setup with A100 nodes being splitted into MIG devices. Out Matlab 2021a (with CUDA 11.8) works fine with these A100 MIGs, but the Matlab 2024a (CUDA 12.2) is having issues recognizing gpuDevice and returns:
> Error using gpuDevice (line 26)
> Encountered error when calling NVML. The NVML error was:
> Invalid Argument.
gpuDeviceCount() did return 1, but parallel.internal.gpu.selectDevice([]) would return [].
While some simple computation like
try
a = gpuArray(rand(10));
b = a + a;
result = gather(b(1,1));
fprintf('GPU computation works! Result: %f\n', result);
catch ME
warning('GPU computation failed: %s', ME.message);
end
did work, but it is nearly impossible to convert an existing package to drop all usage of gpuDevice.
I'll be very interested if there's any solution or bug-fix timeline of this.
Joss Knight
Joss Knight le 10 Juin 2025
Modifié(e) : Joss Knight le 10 Juin 2025
Hi! This bug is fixed in R2025a. See the bug report
Apologies for the delay in my response! Thank you very much for addressing this in R2025a. I really appreciate it.

Connectez-vous pour commenter.

Catégories

Produits

Version

R2024b

Question posée :

le 8 Jan 2025

Commenté :

le 12 Juin 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by