FFT slowdown even after workspace reset

I'm experiencing behavior with the fft() function that is causing me to have to restart Matlab between executions of a long script that is both processing and memory intensive and requires, among other things, millions of fft's on the CPU and GPU. If I run bench() prior to running the script, my computer (i9-13950HX w/64GB of ram, running Windows 11, Matlab R2024a) clocks in very fast. After I run my script, all performance metrics are basically identical except for fft() which clocks >10x slower than before.
No matter what I do to the workspace (clear all, clear classes, clear functions, close all hidden force, clc, reset(gpuDevice), etc.), or the fft planner I cannot bring the performance of fft() back to what it was before execution of the script.
Am I overlooking anything that could reset the performance of the fft short of restarting Matlab itself? I would like to let the computer loop over a bunch of datasets but right now the slowdown in the fft is making this very inefficient. I am currently considering calling the Matlab engine from Python so that I can restart it between script calls to prevent this. I am running Matlab on 2024a and may be able to update to 2024b but cannot upgrade past 2024b.

31 commentaires

dpb
dpb le 2 Juin 2026 à 17:13
I believe probably only Mathworks can address this. Contact official support request at <Product Support Page>
Paul
Paul le 2 Juin 2026 à 18:43
If the culprit is identified, would you mind posting back here as others may be interested in the findings.
Walter Roberson
Walter Roberson le 2 Juin 2026 à 19:31
I note that the i9-13950HX has 8 performance cores and 16 efficiency cores. I wonder if later in the run, most of the computation is being shunted to the efficiency cores?
Timothy
Timothy le 2 Juin 2026 à 22:19
@dpb, thanks I might give that a shot.
@Paul, I found the culprit function, which calls another function which splits a bunch of large complex double precision N x M arrays into three dimensional 32 x M x (N/32) arrays, computes the FFT in the column dimension, multiplies with the conjugate of another 32 x M x (N/32) array, inverse Fourier transforms & normalizes to create a bunch of cross-correlations. However, it seems that if I isolate this sub function and run it a bunch of times by itself it doesn't affect the performance of the fft() function. Still, if I delete the sub function from the larger function I also have no reduction in performance, so I know that it is tied to it somehow. If I can get more specific or create a simple toy function that creates the fft performance loss I'm seeing I will post it here.
@Walter Roberson: Maybe this is happening, I don't know how to monitor what cores are used. Matlab is the main process on my computer however, and this slowdown can be created in under 10 minutes of operations by iterating the function described above in a loop. If I call the bench in each iteration of the loop and store the FFT time I can watch it slow down each iteration starting with the 4th (I get about 25 iterations in 10 minutes, by which time the FFT score has gone from ~0.15 seconds to ~1 second, and keeps slowing down the more times I call the function).
Paul
Paul le 3 Juin 2026 à 11:58
Modifié(e) : Paul le 3 Juin 2026 à 13:14
fft used to have a memory leak, but that was fixed. Maybe a similar, yet different, issue has reared up since then. Matlab 2020a/b fft function memory leak - MATLAB Answers - MATLAB Central
Also, this thread How can I solve memory leak in fft? - MATLAB Answers - MATLAB Central, which isn't really about a memory leak, discusses memory management with fft and seems like it might be on point based on the problem description. Maybe the
fftw(wisdom,[])
command is worth a try. Though it sounds like all of the FFTs are 32-point, so maybe this isn't the issue.
@Paul I think you mean
fftw('wisdom',[])
dpb
dpb le 3 Juin 2026 à 18:51
Modifié(e) : dpb le 3 Juin 2026 à 19:08
The doc for <fftw> is may be a little confusing for that case -- it does show the form as @Paul used, but uses wisdom as a place holder for either 'swisdom' or 'dwisdom'. 'wisdom' alone isn't documented but doesn't error on local system...but it does need to be a character string (or a variable that would contain the string).
Paul
Paul le 3 Juin 2026 à 19:26
Yes, as dpb suggested I was using wisdom as a variable that would have the value ‘dwisdom’ or ‘swisdom’ as appropriate.
dpb
dpb le 3 Juin 2026 à 19:56
Was going to comment that was good spelunking @Paul to find the thread and refer to fftw; something like that was what I had in mind that would get reset on restart but wasn't affected by normal memory clearing, etc., ...back in days of yore before MATLAB and had to use the libraries directly in FORTRAN (before Fortran days, too) I knew about fftw but it had completely slipped my mind in the ensing 40 years. Of course, that also predated having multi-cores, GPUs, parallel computing TBs so one had far more direct knowledge of what was going on inside.
Timothy
Timothy le 4 Juin 2026 à 2:59
@Paul, @dpb, @Walter Roberson I'm sorry I wasn't specific enough in my original post, I have definitely tried reseting the wisdom in fftw for both single and double precision and it did not affect anything.
Steven Lord
Steven Lord le 4 Juin 2026 à 14:43
You indicate you found the "culprit function". After running that, is it FFT calls in isolation that slow down, or is it subsequent runs of the culprit function as a whole?
You also stated "large complex double precision N x M arrays into three dimensional 32 x M x (N/32) arrays" -- how large is "large" in this context? What are typical values for N and M for the data on which you're operating?
I think without seeing that culprit function it's likely going to be difficult to determine what's going on. Please send it to Technical Support so they can work with the developers to understand the problem and try to determine the root cause of the slowdown.
dpb
dpb le 4 Juin 2026 à 15:40
Modifié(e) : dpb le 4 Juin 2026 à 16:32
I figured from the git-go this would take the developers being able to poke at the innards.
Besides the isolation of the given function, that it is something else being done to the state of the GPU on a restart before recovers performance is curious...
First of all will whether it is reproducible on a Mathworks machine or is something unique to @Timothy's particular system. Not too likely, probably, but ya' never know.
Timothy
Timothy le 4 Juin 2026 à 16:23
@Steven Lord, the arrays are ~6000 x 16000 complex double precision matrices. Happy to contact tech support but I posted here to check if I'm missing something obvious which is frequently the case. I'm trying to drill down a bit further to see if I can reproduce the problem with a simpler script before I contact tech support.
dpb
dpb le 4 Juin 2026 à 16:36
It might be interesting/useful to see if the symptom were to go away for some smaller size?
I'd suggest if were able to create such a sample case to go ahead and post it here -- those who do have the TB and could run it (I don't) could also see if it is reproducible on other systems.
Steven Lord
Steven Lord le 4 Juin 2026 à 18:00
After running your culprit function is it FFTs on the CPU that are slow, FFTs on the GPU, or both?
Timothy
Timothy le 4 Juin 2026 à 23:12
Modifié(e) : Timothy le 4 Juin 2026 à 23:13
@Steven Lord CPU, at least, the GPU hasn't been touched yet when I can generate the problem. Here is a script that reproduces part of the problem. The crazy thing is, I was wrong about the FFT calls being a part of the problem. I can delete all of those cross-correlations and still get a slowdown for fft. The example script below is an example:
out = F;
function [out] = F()
for n = 1:10
NN = 500;
MM = 500;
C = cell(MM, NN);
for nn = 1:NN
for mm = 1:MM
C{mm, nn} = randn(21, 21);
end
end
out{n} = C;
disp(n);
end
end
If I run this mini-script and call bench() or just tst = randn(1, 2^25); tic; fft(tst); toc (note that I actually execute: tic; fft(tst); toc, multiple times to get an average and let the planner optimize), I get a slow down of about 2X. On one machine, the fft speed goes from ~0.15 seconds to ~0.3 seconds. If I clear the workspace in this case the fft speed goes back to normal, e.g. ~0.15 seconds. However, if I re-run the mini script above and then re-run tst = randn(1, 2^25); tic; fft(tst); toc (without clearing the workspace) instead of being ~0.3 seconds, now execution of the fft takes ~0.65 seconds. If I clear the workspace, I'm back to ~0.15 seconds. If I run it a third time, now execution of the fft takes ~0.78 seconds (for the five last executions, as I'm writing this, toc registered 0.775059, 0.775901, 0.779967, 0.772605). So something odd with the fft time seems to be happening (tested on R2024a and R2024b, different computers, slightly different results, the 2024b computer has a slowdown of ~0.32, ~0.48, ~0.52, ~0.63 as a I clear the work space and execute the miniscript above between speed tests).
The behavior I am having reproducing from my other script, which is doing a lot more, is the persistence of the slowdown. In my other script, the slowdown of the fft persists even after workspace clearing. I will reach out to tech support.
The slowdown can be observed more easily using the following code:
for n = 1:5
clear out
tst = randn(1, 2^25);
FF = @()fft(tst);
T1 = timeit(FF);
out = F;
tst = randn(1, 2^25);
FF = @()fft(tst);
T2 = timeit(FF);
disp(['Cleared workspace time: ', num2str(T1)]);
disp(['Uncleared workspace time: ', num2str(T2)]);
drawnow;
end
function [out] = F()
out = cell(1, 10);
for n = 1:15
NN = 500;
MM = 500;
C = cell(MM, NN);
for nn = 1:NN
for mm = 1:MM
C{mm, nn} = randn(21, 21);
end
end
out{n} = C;
end
end
My output was:
Cleared workspace time: 0.17438
Uncleared workspace time: 0.40605
Cleared workspace time: 0.17214
Uncleared workspace time: 0.9722
Cleared workspace time: 0.17484
Uncleared workspace time: 1.8431
Cleared workspace time: 0.17464
Uncleared workspace time: 1.8422
Cleared workspace time: 0.17555
Uncleared workspace time: 3.3422
on the machine I'm currently at. Note this doesn't reproduce the persistence (despite clearing the workspace) that I'm observing elsewhere, but I don't know if that persistence is necessary to cause the performance drop I'm seeing in my original code.
Walter Roberson
Walter Roberson le 5 Juin 2026 à 1:39
Data point: the problem does NOT occur on my Mac Tahoe 26.5 Intel I9-10910 (10 cores @3.6 GHz, no efficiency cores) when running MATLAB R2024a, or R2025b.
R2024a result:
Cleared workspace time: 0.23919
Uncleared workspace time: 0.23026
Cleared workspace time: 0.22924
Uncleared workspace time: 0.22899
Cleared workspace time: 0.22922
Uncleared workspace time: 0.22285
Cleared workspace time: 0.23152
Uncleared workspace time: 0.22992
Cleared workspace time: 0.22808
Uncleared workspace time: 0.23232
Timothy
Timothy le 5 Juin 2026 à 1:46
@Walter Roberson GTK thanks!
Paul
Paul le 5 Juin 2026 à 2:04
How long does it take that code to run in wall clock time?
Does it matter if out is preallocated as
out = cell(1,15)
to be consistent with the loop over n = 1:15?
Timothy
Timothy le 5 Juin 2026 à 2:27
Modifié(e) : Walter Roberson il y a environ 3 heures
@Paul that was an artifact of me testing different loop lengths, thanks for catching that but it doesn't affect things much. Total run times is ~3 minutes 30 seconds, and with a consistent cell size I got the following output:
Cleared workspace time: 0.1707
Uncleared workspace time: 0.43045
Cleared workspace time: 0.17297
Uncleared workspace time: 1.0535
Cleared workspace time: 0.17199
Uncleared workspace time: 1.687
Cleared workspace time: 0.16579
Uncleared workspace time: 1.908
Cleared workspace time: 0.17425
Uncleared workspace time: 2.4588
dpb
dpb le 5 Juin 2026 à 13:16
Modifié(e) : dpb le 5 Juin 2026 à 19:26
Brings this old system nearly to its knees...much more inconsistent than others
R2021b, Win10
>> fnhrs=@(n)(n-fix(n))*24;
>> h0=fnhrs(now); tim; h1=fnhrs(now); fprintf('\nElapsed time=%f min\n',(h1-h0)*60)
Cleared workspace time: 0.57853
Uncleared workspace time: 0.59925
Cleared workspace time: 0.60983
Uncleared workspace time: 0.63416
Cleared workspace time: 0.58143
Uncleared workspace time: 5.4506
Cleared workspace time: 0.56621
Uncleared workspace time: 2.4645
Cleared workspace time: 0.58384
Uncleared workspace time: 0.64495
Elapsed time=13.275917 min
>>
Timothy
Timothy le 5 Juin 2026 à 15:51
@dpb I have 32GB of ram on one computer and 64 on the other so the script might have hammered your computer; I can reproduce my results (but with less of a slowdown) for a matrix in the inner loop of 11 x 11, which takes less memory. Your results, however, don't mirror mine and I am wondering now if this is actually a Matlab & Windows 11 memory management thing.
dpb
dpb le 5 Juin 2026 à 16:18
Modifié(e) : dpb le 5 Juin 2026 à 19:27
Yeah, I'm sure it was disk thrashing -- although I had kinda' forgotten when the other machine died and I resurrected this one that it has only 16GB and I didn't have any compatible sticks around to use so just left it having retired from the real consulting gig so big stuff doesn't come around much any more.
Anyways, if I do then rerun with M=11, the results are signficantly different...
>> h0=fnhrs(now); tim; h1=fnhrs(now); fprintf('\nElapsed time=%0.1f min\n',(h1-h0)*60)
Cleared workspace time: 0.58582
Uncleared workspace time: 0.59993
Cleared workspace time: 0.57683
Uncleared workspace time: 0.58008
Cleared workspace time: 0.59105
Uncleared workspace time: 0.568
Cleared workspace time: 0.58876
Uncleared workspace time: 0.56444
Cleared workspace time: 0.58453
Uncleared workspace time: 0.57072
Elapsed time=5.6 min
>>
I wouldn't be surprised about being OS memory management related; which was why I wondered earlier if it could be shown to be related to memory footprint and whether there was any discernible degradation with size or whether just "over the cliff" at some point.
ADDENDUM
I made slight modification to the F() function to accomodate...
function [out] = F()
M=11;
N=15;
out= cell(1, N);
for n = 1:N
NN = 500;
MM = 500;
C = cell(MM, NN);
for nn = 1:NN
for mm = 1:MM
C{mm, nn} = randn(M);
end
end
out{n} = C;
end
end
ADDENDUM SECOND
I hadn't looked that closely, but another slight rearrangeement as
function [out] = F()
M=11;
N=15;
NN = 500;
MM = 500;
C=cell(MM, NN);
out= cell(1, N);
for n = 1:N
for nn = 1:NN
for mm = 1:MM
C{mm, nn} = randn(M);
end
end
out{n} = C;
end
end
of moving the constants and preallocation out of the loop resulted in essentially the same timings but ran signfificantly faster clock time...
>> h0=fnhrs(now); tim; h1=fnhrs(now); fprintf('\nElapsed time=%0.1f min\n',(h1-h0)*60)
Cleared workspace time: 0.58512
Uncleared workspace time: 0.56866
Cleared workspace time: 0.58118
Uncleared workspace time: 0.55005
Cleared workspace time: 0.57309
Uncleared workspace time: 0.55476
Cleared workspace time: 0.58512
Uncleared workspace time: 0.56051
Cleared workspace time: 0.57412
Uncleared workspace time: 0.56525
Elapsed time=3.9 min
>>
Timothy
Timothy le 5 Juin 2026 à 16:55
Modifié(e) : Walter Roberson il y a environ 3 heures
Thanks @dpb, I ran your updated version of the loop on my computer and you can see performance walk down to >1 second for the fft in the uncleared environment despite having 2X the memory of the system you used and a faster chip.
Cleared workspace time: 0.16423
Uncleared workspace time: 0.28845
Cleared workspace time: 0.17935
Uncleared workspace time: 0.5818
Cleared workspace time: 0.1754
Uncleared workspace time: 0.76682
Cleared workspace time: 0.1651
Uncleared workspace time: 0.9451
Cleared workspace time: 0.17021
Uncleared workspace time: 1.0201
dpb
dpb le 5 Juin 2026 à 17:10
Peculiar, indeed, is all I know to say at this point...
Timothy
Timothy le 5 Juin 2026 à 17:18
@dpb yeah I finally took your original advice (comment #1!) and wrote into tech support ...
One last fling before let it ride -- will be interesting to hear what light Mathworks support can shed on the symptoms you observe. But, why not try it on competing OS and recent release to see what happens...
fnhrs=@(n)(n-fix(n))*24;
h0=fnhrs(now);
for n = 1:5
clear out
tst = randn(1, 2^25);
FF = @()fft(tst);
T1 = timeit(FF);
out = F;
tst = randn(1, 2^25);
FF = @()fft(tst);
T2 = timeit(FF);
disp(['Cleared workspace time: ', num2str(T1,'%0.5f')]);
disp(['Uncleared workspace time: ', num2str(T2,'%0.5f')]);
%drawnow;
end
Cleared workspace time: 0.25048
Uncleared workspace time: 0.22922
Cleared workspace time: 0.23914
Uncleared workspace time: 0.24910
Cleared workspace time: 0.27493
Uncleared workspace time: 0.26221
Cleared workspace time: 0.25566
Uncleared workspace time: 0.25958
Cleared workspace time: 0.25795
Uncleared workspace time: 0.26588
h1=fnhrs(now); fprintf('\nElapsed time=%0.1f min\n',(h1-h0)*60)
Elapsed time=1.1 min
function [out] = F()
M=11;
N=15;
NN = 500;
MM = 500;
C=cell(MM, NN);
out= cell(1, N);
for n = 1:N
for nn = 1:NN
for mm = 1:MM
C{mm, nn} = randn(M);
end
end
out{n} = C;
end
end
Timothy
Timothy le 5 Juin 2026 à 20:24
@dpb it is a good idea. I explored this a bit more w/Gemini and as far as I can understand, it appears that Windows 11(vs 10) incorporates some weird memory location randomization stuff to prevent certain types of memory-based attacks. This, compounded with me smashing the memory with literally millions of individual cell array pointers that get uniformly scattered across memory causes a stacking fragmentation that penalizes functions like fft() which needs large contiguous chunks. So: if operating systems handle memory very differently, then I imagine this problem might not effect other OS's at all...
dpb
dpb le 5 Juin 2026 à 21:14
Modifié(e) : dpb le 5 Juin 2026 à 21:50
I wondered about fragmentation being an issue. I haven't moved from W10 to W11 w/ old hardware and had not heard that about it. That's ugly if one can't manage to achieve a block of contiguous memory if/when needed.
dpb
dpb le 6 Juin 2026 à 16:05
Modifié(e) : dpb il y a environ 9 heures
The coment above about W11 memory management got me curious so I asked an AI 'bot -- some most interesting responses--whether they're really right or not only testing will tell, but here's the conversation.
I started just asking about allocating a large block of contiguous memory to see what it said about that. My input is bold italics...
At the very end when I asked about W11 specifically and W10, MacOS and Linux apparently being unaffected, it added some additional information on W11 that just might be the crux of the matter.
Anyways, here's the whole session --
"Can an application acquire large contiguous memory in Windows 11 for computation speed such as large FFT?"
Yes, a Windows 11 application can acquire large contiguous blocks of virtual memory. However, the operating system's paged memory architecture means your block will likely be scattered across non-contiguous physical RAM. To optimize high-speed tasks like large Fast Fourier Transforms (FFT), this distinction determines the optimal approach. [1, 2, 3, 4, 5]
1. Contiguous Virtual Memory (The App Standard)
For user-mode applications, you don't need physically contiguous memory to get high computation speeds; virtually contiguous memory works just as well. [1]
  • How to Allocate: You can use standard C++ functions like new, malloc, or _aligned_malloc for memory alignment. [1]
  • Benefits for FFT: With a contiguous virtual address space, the CPU can calculate addresses via simple arithmetic, preventing cache-misses and maximizing the efficiency of the CPU's memory prefetcher. [1, 2, 3]
  • Best Practice: The Windows Virtual Memory Manager (VMM) might still swap pages to disk if your memory is under pressure. To keep your data resident, you can use the VirtualLock function to lock the memory in your working set. [1]
2. Contiguous Physical Memory
If you are writing a kernel-mode driver that interfaces directly with hardware (like a GPU or FPGA performing the FFT via Direct Memory Access), you can allocate physically contiguous memory. [1, 2, 3]
  • How to Allocate: You must use kernel-mode Windows APIs like MmAllocateContiguousMemory or MmAllocateContiguousMemorySpecifyCache.
  • The Catch: Because the OS constantly allocates and frees physical pages over time, RAM becomes heavily fragmented. Allocating massive amounts of contiguous physical memory becomes difficult and is highly likely to fail on a system that has been running for a while. [1, 2, 3, 4]
3. Hardware Acceleration (GPUs)
For massive multi-dimensional FFTs (often spanning gigabytes), the industry standard is to utilize hardware acceleration rather than stressing the CPU's RAM. [1]
  • How to Allocate: Use dedicated compute libraries like NVIDIA's cuFFT (via CUDA).
  • Best Practice: Applications use cudaHostAlloc or similar device APIs to reserve aligned host memory for rapid transfers to the GPU. [1, 2]
4. Forcing Windows 11 "Large Pages"
If you are doing heavy mathematical computations entirely on the CPU, you can request 2MB Large Pages (instead of the standard 4KB pages) from Windows 11. This reduces Translation Lookaside Buffer (TLB) misses, resulting in a noticeable speedup for heavy processing like an FFT. [1]
  • How to Allocate: Use VirtualAlloc with the MEM_LARGE_PAGES flag (this requires enabling the "Lock Pages in Memory" privilege in Windows Security Policies). [1]
If you are developing a native Windows C++ application, I can help you:
  • Review implementation details for using Win32 API calls like VirtualAlloc with MEM_LARGE_PAGES.
  • Walk through configuring FFT libraries like FFTW to take advantage of these memory configurations. [1]
Let me know which specific route you want to take. [1]
20 sites
  • MmAllocateContiguousMemory function (wdm.h)The routine maps this block to a contiguous block of virtual memory in the system address space and returns the virtual address of...Microsoft Learn
  • Virtual memory clarification - allocation of large contiguous ...Your system will get a little bit better performance if the OS is able to allocate the pages in contiguous blocks of 2MB a piece [Stack Overflow
  • Allocating System-Space Memory - Windows drivers - Microsoft LearnDrivers can use only the kernel stack to pass small amounts of data to their internal routines. Some drivers have to allocate addi...Microsoft Learn
"This application is MATLAB which uses FFTW and is showing decreasing performance with multiple calls"
You said: this application is MATLAB which uses FFTW and is showing decreasing performance with multiple calls
When MATLAB degrades in performance over multiple calls to fft or fftn, the root cause is rarely the lack of contiguous memory. Instead, it is usually caused by FFTW wisdom accumulation (internal planning caches growing too large), variable array sizes forcing constant replanning, or denormal numbers/NaNs corrupting the floating-point pipeline. [1, 2, 3, 4, 5]
The most common causes of this specific behavior can be isolated and resolved using the following techniques.
1. Clear or Freeze FFTW Wisdom (Most Common Fix)
MATLAB uses an internal system called "Wisdom" to remember the fastest way to calculate an FFT of a specific size. If your script processes many arrays of varying or odd dimensions, FFTW continually allocates internal heap memory to store new execution plans. This can balloon memory usage and degrade lookup performance. [1, 2, 3, 4]
You can fix this by clearing the single-precision and double-precision wisdom caches periodically in your loop: [1]
matlab
% Clear the accumulated planning cache
fftw(
'swisdom', []);
fftw(
'dwisdom', []);
Use code with caution.
Alternatively, switch the planner to estimate. This forces FFTW to use a fixed, heuristic-based mathematical approach rather than spending time dynamically measuring your hardware's memory architecture for an optimal layout. [1, 2]
matlab
fftw('planner', 'estimate');
Use code with caution.
2. Force Array Sizes to Powers of Two (Zero-Padding)
If your input sizes are odd, prime numbers, or fluctuate constantly, FFTW cannot reuse plans and must rebuild them. [1, 2]
  • The Math: FFTW is exceptionally fast for sizes that are powers of two (\(2^{n}\)), or products of small primes (\(3, 5, 7\)).
  • The Fix: Pad your input matrix to the next highest power of two. This makes execution highly predictable and allows MATLAB to hit the same cached FFTW plans over and over. [1, 2]
matlab
% Compute the ideal next power of 2 for size optimization
n_optimal = nextpow2(length(x));
N = 2^n_optimal;
% Pass N directly into the fft function to auto-pad with zeros
X = fft(x, N);
3. Check for Data Contamination (NaNs, Infs, and Denormals)
If your loop progressively introduces invalid values into your dataset, performance will plunge. [1]
  • The Cause: When NaN (Not a Number), Inf, or denormal numbers (values extremely close to zero) slip into an array, the CPU drops out of its fast hardware processing mode. It handles these edge cases using microcode exceptions, slowing the calculation speed down by a factor of 10 to 100.
  • The Fix: Scan your input array right before the FFT call to ensure no bad data is leaking into the operation: [1]
matlab
if any(isnan(x), 'all') || any(isinf(x), 'all')
error(
'NaN or Inf detected before FFT calculation!');
end
4. Prevent In-Place Memory Allocation Fragmentation
If your loop structure overwrites variables without explicitly preallocating them, MATLAB may fragment your virtual address space over time. Ensure you are not dynamically growing matrices inside the iteration loop.
  • Bad: data(:, i) = fft(matrix(:, i)); (Where data grows dynamically).
  • Good: Preallocate the entire matrix ahead of time using zeros() to force MATLAB to request a single, contiguous block of virtual memory from Windows up front.
To narrow this down, let me know:
  • Are the input array sizes changing on every call, or do they stay exactly the same?
  • Are you executing this on a CPU or utilizing GPU architectures (via gpuArray)? [1]
9 sites
  • Speed Up Fast Fourier Transforms in Generated Standalone ...Your MATLAB code calls one of these functions: fft , fft2 , fftn , ifft , ifft2 , or ifftn . You generate standalone C/C++ code. Y...MathWorks
  • fftw - Define method for determining FFT algorithm - MATLABOptimal Fourier Transform Workflow. Open Live Script. The fftw function can improve the performance of Fourier transform functions...MathWorks
  • How to Do FFT in MATLABApr 28, 2022 — when you play data the strings oscillate over time in different repeating patterns that can produce different sounds analyzing the...4:42YouTube·MATLAB
"On CPU and performance still degrades after clearing wisdom. The FFT is using a power of 2 (2^25, precisely)"
You said: on CPU and performance still degrades after clearing wisdom. The FFT is using a power of 2 (2^25, precisely)
Since your input size is fixed to exactly \(2^{25}\) (which is a massive array size occupying either 256MB for single-precision or 512MB for double-precision per array) and performance degrades even after clearing wisdom, the issue is almost certainly a system resource bottleneck or memory pipeline stall rather than an algorithm planning issue. [1, 2, 3]
At (2^{25}, your data vastly exceeds the CPU's internal L1/L2/L3 hardware caches. The calculations are completely bounded by main system RAM bandwidth. Three distinct root causes usually trigger this degradation across multiple loop iterations, along with ways to solve them. [1, 2]
1. The Multi-Thread Accumulation Trap (Most Common)
For an array of size \(2^{25}\), MATLAB uses its internal multi-threading engine (Intel MKL/FFTW) to spawn threads across all available CPU physical cores. [1]
  • The Problem: In loops, if MATLAB finishes an iteration and immediately starts the next one, the OS thread scheduler doesn't have time to put worker threads to sleep properly. Threads "spin" waiting for work, causing severe thread contention, context switching overhead, and CPU cache thrashing. The processor runs out of memory bus bandwidth, stalling subsequent calls. [1]
  • The Fix: Explicitly restrict MATLAB's multi-threading pool or inject a fractional pause to allow the thread-pool manager to reset its queues
matlab
% Test if single-threading stabilizes the execution time
maxNumCompThreads(1);
% If you must use multithreading, add a tiny pause at the end of the loop
% to let the OS thread-scheduler clear the CPU pipeline
pause(0.005);
2. Physical Memory "Page Fault" Accumulation
Even if your workspace variable has a permanent, fixed size, MATLAB may be reallocating the underlying memory page tables in Windows 11 if the variable is modified or re-assigned inside the loop.
  • The Problem: A \(2^{25}\) double-precision array takes 512MB. If you calculate X = fft(x); inside a loop, MATLAB has to constantly request and release large chunks of virtual memory. This forces the Windows 11 kernel to zero-out old memory pages and remap virtual-to-physical space, causing major page faults that degrade performance over time.
  • The Fix: Force MATLAB to reuse the exact same physical memory buffer using in-place assignment. Do not change the variable name, and overwrite the memory space directly:
matlab
% Preallocate exactly once outside the loop
X = zeros(1, 2^25);
for i = 1:num_iterations
% ... load or generate your data into X ...
% Force an in-place FFT calculation on the same memory block
X = fft(X);
% Avoid creating new variable copies during down-stream processing
end
3. Diverging Accumulation / Denormal Stalls
In some iterative algorithms (such as back-and-forth filtering or updating a signal based on its previous state), repeatedly calculating an FFT without proper scaling causes numbers to exponentially grow or shrink. [1]
  • The Problem: Because MATLAB's fft calculates an unnormalized discrete transform, if numbers shrink down close to zero, they hit the Denormal (subnormal) range. When CPUs process denormal floating-point numbers, they cannot use hardware execution units and drop down to microcode emulation, creating a 10x–100x performance drop. [1]
  • The Fix: Ensure your loop scales data appropriately, or forcefully disable subnormal processing on your CPU if you are using an Intel/AMD processor:
matlab
% Ensure data does not contain underflowing values right before the FFT
if max(abs(X(:))) < 1e-20
X(:) = 0; % Force tiny numbers cleanly to actual zero
end
Step-by-Step Diagnostic Plan
To completely isolate the problem, run this diagnostic sequence inside your loop:
  1. Monitor Memory Garbage Collection: Add pack or java.lang.System.gc(); to the end of your loop to check if forcing a garbage collection prevents the degradation.
  2. Isolate Memory vs. Compute: Temporarily comment out the fft line but keep your array generation logic. If the slowdown stops, it is a threading/FFTW issue. If the slowdown persists, Windows is struggling with memory allocation. [1]
  3. Switch to Single Precision: Change your input array to single(x). This cuts the memory bandwidth in half (from 512MB to 256MB), vastly easing the strain on the Windows 11 virtual memory manager.
Let me know the results of these adjustments:
  • Did switching MATLAB to single-thread mode via maxNumCompThreads(1) stop the performance degradation?
  • Are you overwriting the exact same variable name in-place inside the loop?
You said: "This seems to occur only on Windows 11, not Windows 10, Mac OS or Linux systems running same benchmark code"
Because this degradation happens exclusively on Windows 11—and not on Windows 10, macOS, or Linux—the problem is narrowed down to two structural changes unique to Windows 11: aggressive background thread throttling (Intel Thread Director) and Virtualization-Based Security (VBS) memory overhead. [1]
At (2^25) elements (512MB for doubles), your workload heavily pushes both the limits of CPU thread scheduling and memory bandwidth. Three Windows 11 specific mechanisms cause this exact loop-degradation behavior along with methods to eliminate them. [1]
1. The Windows 11 "Background" Throttling Trap (Intel/AMD Hybrid Architecture)
Windows 11 introduced a strict thread priority manager linked to the hardware Intel Thread Director (or AMD equivalents). [1]
  • The Problem: If MATLAB loses active window focus (even for a split second, or if you click on another app), or if the OS falsely flags the execution loop as a background process, Windows 11 forcefully demotes MATLAB's multi-threaded worker threads to Efficiency Cores (E-Cores) or places them in "Eco Mode". The next time the loop hits the fft call, the worker threads are stuck on weak cores, or the OS struggles to shift them back to Performance Cores (P-Cores). This causes cascading execution delays. [1, 2]
  • The Fix: Prevent Windows 11 from managing your CPU thread affinity.
  1. Open Windows Settings > System > Power & Battery. Change the Power Mode to Best Performance.
  2. Launch MATLAB and run your loop.
  3. Open the Windows Task Manager, go to the Details tab, right-click MATLAB.exe, select Set Priority, and change it to Above Normal or High. [1]
2. Windows 11 Virtualization-Based Security (VBS / HVCI)
By default, Windows 11 enforces Virtualization-Based Security (VBS) and Hypervisor-Protected Code Integrity (HVCI), features usually disabled or un-enforced on Windows 10 upgraders.
  • The Problem: VBS runs Windows inside a thin hypervisor layer to isolate system memory. When MATLAB makes massive consecutive allocations or memory page table updates for a 512MB array, the memory mapping requests must pass through this hypervisor layer. Over repeated iterations, memory translation buffers get fragmented, leading to an inflation in memory access latency that does not happen on Linux or macOS. [1]
  • The Fix: Turn off VBS to check if it matches bare-metal Linux/Windows 10 performance:
  1. Press the Windows Key, type Core Isolation, and open it.
  2. Toggle Memory Integrity to Off.
  3. Restart your computer and rerun your benchmark.
3. Windows 11 Dynamic Memory Compression & Aggressive Working Sets
Windows 11 handles memory management and page-file swapping much more aggressively than Windows 10 to keep system memory looking "free."
  • The Problem: When MATLAB finishes a loop iteration and briefly pauses computation to index data, Windows 11 immediately marks sections of that 512MB virtual workspace chunk as eligible for compression or swapping. On the next loop, MATLAB triggers hidden "hard" or "soft" page faults to reclaim that memory space, leading to a loop that slows down incrementally over time.
  • The Fix: Lock MATLAB's physical memory footprint. You can bypass the Windows 11 virtual page manager entirely by telling MATLAB to use standard system heap memory utilities natively via your script:matlab% Force MATLAB to actively keep its memory page clean during long loops
% Put this snippet at the very end of your iteration loop
System.gc;
The 60-Second Test to Prove it's Windows 11 Thread Scheduling
To definitively prove if Windows 11 is mismanaging your CPU cores, force MATLAB's underlying MKL thread pool to map strictly to your physical Performance Cores. Execute this in your command window before starting the benchmark loop:
matlab
% Force the Intel MKL engine inside MATLAB to disable thread-affinity changes
setenv('MKL_DYNAMIC', 'FALSE');
% Restrict the threads strictly to physical execution units
% (e.g., if you have 8 Performance Cores, set this to 8)
maxNumCompThreads(8);
If you apply Best Performance Mode and lock your thread count, let me know if the execution speed finally holds steady across your loop iterations. [1]

Connectez-vous pour commenter.

Réponses (0)

Catégories

En savoir plus sur Performance and Memory dans Centre d'aide et File Exchange

Produits

Version

R2024a

Tags

Question posée :

le 2 Juin 2026 à 17:02

Modifié(e) :

il y a environ 14 heures

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by