Vectorized code slower than loops

121 vues (au cours des 30 derniers jours)
Alessandro
Alessandro le 4 Déc 2024 à 23:56
Commenté : Joss Knight le 7 Déc 2024 à 16:01
I am trying to speed up the code reported below (it is a MWE, actual code is more complicated). I vectorized the inner loop (see method 2) and, while the vectorized code is correct, it is slower than the original code with a nested loop.
I would be grateful if someone has any idea on how to improve the performance (given that my attempt at vectorization made performance worse).
clear
clc
close all
n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
disp('Method 1')
Method 1
tic
% Initialize output
V1 = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(EVz,n_d);
for a_c=1:n_a
entireRHS = RetMat(:,a_c,z_c)+beta*EVz_big; %(d*a',1)
[max_val,max_ind] = max(entireRHS); %scalar
%[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
V1(a_c,z_c) = max_val;
Policy1(a_c,z_c)=max_ind;
end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
toc
Elapsed time is 0.321922 seconds.
disp('Method 2 (vectorized)')
Method 2 (vectorized)
tic
V2 = zeros(n_a,n_z);
Policy2 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(EVz,n_d);
entireRHS = RetMat(:,:,z_c)+beta*EVz_big; %(d*a',a)
[max_val,max_ind] = max(entireRHS,[],1); %(1,a)
V2(:,z_c) = max_val;
Policy2(:,z_c)=max_ind;
end
Policy2_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
toc
Elapsed time is 0.488226 seconds.
err = max(abs(V2-V1),[],"all");
disp(err)
0
err = max(abs(Policy2_all-Policy1_all),[],"all");
disp(err)
0

Réponse acceptée

Epsilon
Epsilon le 5 Déc 2024 à 15:16
Hi Alessandro,
Method1 is handling a single column vector at a time and is faster than method2 which computes the entireRHS at once by adding the matrix RetMat(:,:,z_c) of size 3065 x 601 to the expanded vector EVz_big. Vectorization is generally faster due to lesser loop overheads, but in this case the larger size of data being processed can slow down the computation.
To further optimize try the below method:
clear
clc
close all
n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
disp('Method 1')
Method 1
tic
% Initialize output
V1 = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(EVz,n_d);
for a_c=1:n_a
entireRHS = RetMat(:,a_c,z_c)+beta*EVz_big; %(d*a',1)
[max_val,max_ind] = max(entireRHS); %scalar
%[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
V1(a_c,z_c) = max_val;
Policy1(a_c,z_c)=max_ind;
end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
toc
Elapsed time is 0.392334 seconds.
disp('Method 2 (suggested)')
Method 2 (suggested)
tic
% Initialize output
V2 = zeros(n_a, n_z);
Policy2 = ones(n_a, n_z);
Policy2_all = zeros(2, n_a, n_z);
% Precompute
RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
EV = V_next * pi_z';
for z_c = 1:n_z
EVz_big = repelem(EV(:, z_c), n_d);
for a_c = 1:n_a
entireRHS = RetMat_reshaped(:, a_c, z_c) + beta * EVz_big;
[max_val, max_ind] = max(entireRHS);
V2(a_c, z_c) = max_val;
Policy2(a_c, z_c) = max_ind;
end
end
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1, :, :) = d_opt;
Policy2_all(2, :, :) = aprime_opt;
toc
Elapsed time is 0.310808 seconds.
err = max(abs(V2-V1),[],"all");
disp(err)
0
err = max(abs(Policy2_all-Policy1_all),[],"all");
disp(err)
0
Hope it helps!
  1 commentaire
Alessandro
Alessandro le 5 Déc 2024 à 18:18
Thanks for your help, it is a bit faster.
One small comment: I think you don't need the line
RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
in the sense that it is done already at the very beginning (and I exclude it from the timing since it is part of data generating). In your code you are doing it twice

Connectez-vous pour commenter.

Plus de réponses (5)

Steven Lord
Steven Lord le 5 Déc 2024 à 16:52
Vectorization is one tool in your MATLAB toolbox, but so is the for loop. Each has its own pros and cons.
I think Mike Croucher is planning on writing a blog post about this, but at the risk of stealing his thunder consider the analogy of a bookshelf. If you want to move all the books from one bookshelf to another (or to packing boxes if you're moving), there are multiple approaches you could take.
If there aren't very many books and they're all light (say paperbacks) you could use a vectorized approach, carrying the whole pile at once from the first bookshelf to the next. This could save time by eliminating some or all of the round-trips between the two bookshelves.
If there are a lot of books, or some/all of them are heavy (like my bookshelf of roleplaying books, probably about 250 hard-cover books) the vectorized approach may not be possible or may be less feasible than a semi-vectorized or iterative approach. I'm no Hafþór Júlíus Björnsson, able to lift half a ton of weight. So instead of transporting all the books at once, I carried one at a time (for loop) or a small group at a time (semi-vectorized approach.)
In this case, your vectorized approach tells MATLAB "carry all the books on one shelf of the bookshelf at once" and MATLAB has to take its time each trip carrying hundreds of pounds of books. In the iterative approach it's carrying one book at a time and the round-trip travel time between bookshelves is small enough that it actually saves time over staggering under the weight of those hundreds of pounds of books.
I'm not saying abandon vectorization, of course. But just realize that neither vectorization nor for loops are going to be the best in every single scenario you encounter.
  1 commentaire
Alessandro
Alessandro le 5 Déc 2024 à 18:20
Thanks for the very nice explanation. Looking forward to Mike Croucher blog post :)

Connectez-vous pour commenter.


John D'Errico
John D'Errico le 5 Déc 2024 à 17:29
Modifié(e) : John D'Errico le 5 Déc 2024 à 17:32
Read the response by @Steven Lord again, in case you missed it. (Actually, that is always good advice for any of his responses.)
Here is my take. Vectorization is a good thing much of the time. But the way I think of most vectorization code is it employs internal implicit loops. As such, they are going to be quite fast, more efficient than explicit loops written in MATLAB. So you CAN gain there. (Over the years, the MATLAB parser has grown greatly in capability, making some of those explicitly writtten loops quite efficient. Implicit loops will pretty much always win though.) But remember the cost of many vectorizations is in memory. They may force your code to generate a huge number of potenial solutions, but then throw away almost all of them.
That trade off between implicit, fast loops and memory will be important, and you need to understand it.
For example, there are many ways we can see vectorization employed. A simple matrix multiply is one. Could I write this:
A = rand(1000);
B = rand(1000);
C = A*B;
Surely you would see that as a vectorized form of the alternative of triply nested loops? In this case, there is no additional work being done. The same adds and multiplies are being done in either case. It is all in the form of implicit loops, and all done by the BLAS. A great form of vectorization.
Another common instance where we might see vectorization is to ask for the set of 20 bit binary numbers with exactly 5 bits represented as 1. The simple vectorized solution might be to generate all 20 bit numbers, something like dec2bin(0:2^20-1), and then discard those with an unacceptable number of bits. And that would work nicely enough, unless you decided to try the same thing with 100 or 1000 bit numbers.
That is, vectorization can fail miserably when it is abused. You as the user need to understand the difference, and how the vectorization was written, what it is doing internally, and when it will fail.
Finally, I've seen examples of solutions to problems that were described as vectorization, but are really not in either of the above categories. Instead, these are cases where an intelligent application of mathematics solves a problem that would have been achieved by brute force approaches otherwise. Feel free to call that not vectorization.

Mike Croucher
Mike Croucher le 5 Déc 2024 à 22:50
Modifié(e) : Mike Croucher le 5 Déc 2024 à 22:58
I made a couple more modifications and came up with a double for-loop that is fastest. One thing that I'd like to point out here is that all of your versions are 'vectorised' in some sense. For example, in your original you have
entireRHS = RetMat(:,a_c,z_c)+ beta * EVz_big;
EVz_big is a vector so that line has got vectorised multiplication and addition. Every one of your examples has got a vectorised component and a loop component. What we've all been doing in this thread as we try to improve speed is change how we chop up the vectors. With your originial 'Method 2 (vectorised)' you had this line
entireRHS = RetMat(:,:,z_c)+beta*EVz_big;
and, as suggested by @Epsilon, that line is creating some large temporary matrices. It takes time to do this and you are not doing enough compute on them to justify the bother.
n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
tic
% Initialize output
V1 = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(EVz,n_d);
for a_c=1:n_a
entireRHS = RetMat(:,a_c,z_c)+beta*EVz_big; %(d*a',1)
[max_val,max_ind] = max(entireRHS); %scalar
%[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
V1(a_c,z_c) = max_val;
Policy1(a_c,z_c)=max_ind;
end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
method1 = toc;
tic
V2 = zeros(n_a,n_z);
Policy2 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(EVz,n_d);
entireRHS = RetMat(:,:,z_c)+beta*EVz_big; %(d*a',a)
[max_val,max_ind] = max(entireRHS,[],1); %(1,a)
V2(:,z_c) = max_val;
Policy2(:,z_c)=max_ind;
end
Policy2_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
method2 = toc;
tic
% Initialize output
V2 = zeros(n_a, n_z);
Policy2 = ones(n_a, n_z);
Policy2_all = zeros(2, n_a, n_z);
% Precompute
%RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
EV = V_next * pi_z';
for z_c = 1:n_z
EVz_big = repelem(EV(:, z_c), n_d);
for a_c = 1:n_a
entireRHS = RetMat(:, a_c, z_c) + beta * EVz_big;
[max_val, max_ind] = max(entireRHS);
V2(a_c, z_c) = max_val;
Policy2(a_c, z_c) = max_ind;
end
end
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1, :, :) = d_opt;
Policy2_all(2, :, :) = aprime_opt;
method3 = toc;
tic
% Initialize output
V2 = zeros(n_a, n_z);
Policy2 = ones(n_a, n_z);
Policy2_all = zeros(2, n_a, n_z);
% Precompute
%RetMat_reshaped = reshape(RetMat, [n_d * n_aprime, n_a, n_z]);
EV = V_next * pi_z';
for z_c = 1:n_z
EVz_big = repelem(beta*EV(:, z_c), n_d);
for a_c = 1:n_a
entireRHS = RetMat(:, a_c, z_c) + EVz_big;
[max_val, max_ind] = max(entireRHS);
V2(a_c, z_c) = max_val;
Policy2(a_c, z_c) = max_ind;
end
end
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1, :, :) = d_opt;
Policy2_all(2, :, :) = aprime_opt;
method4 = toc;
tic
% Initialize output
V2 = zeros(n_a,n_z);
Policy2 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(beta*EVz,n_d);
for a_c=1:n_a
[max_val,max_ind] = max(RetMat(:,a_c,z_c) + EVz_big); %(d*a',1)r
%[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
V2(a_c,z_c) = max_val;
Policy2(a_c,z_c)=max_ind;
end
end
Policy2_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
method5 = toc;
fprintf("Method 1 : %f seconds\n",method1);fprintf("Method 2 (vectorized) : %f seconds\n",method2);fprintf("Method 2 (suggested) with Alessandro fix: %f seconds\n",method3);fprintf("Method 2 (suggested) with Croucher mod: %f seconds\n",method4);fprintf("Croucher loop : %f seconds\n",method5)
Method 1 : 0.344052 seconds Method 2 (vectorized) : 0.568002 seconds Method 2 (suggested) with Alessandro fix: 0.361932 seconds Method 2 (suggested) with Croucher mod: 0.333267 seconds Croucher loop : 0.303311 seconds
So, here my final version wins but it turns out that this seems to be machine dependent. I ran this code on my M2 Mac and got the following
Method 1 : 0.170735 seconds
Method 2 (vectorized) : 0.164271 seconds
Method 2 (suggested) with Alessandro fix: 0.152090 seconds
Method 2 (suggested) with Croucher mod: 0.140408 seconds
Croucher loop : 0.142356 seconds
The final vectorised version is very slightly, but consistently faster than 'Croucher loop'. The Mac is also faster than any other of my personal machines for this problem.

Mike Croucher
Mike Croucher le 6 Déc 2024 à 0:13
Another option, if you have the hardware, is to use a GPU. Here, you want to do as much vectorisation as you can. Using my desktop now, which has a slower CPU than whatever is used when we run code in the forums but it has quite a nice GPU (NVIDIA RTX 3070).
Here's the code
n_d = 51;
n_a = 601;
n_aprime = n_a;
n_z = 7;
beta = 0.98;
rng("default")
% Fake data
RetMat = rand(n_d,n_aprime,n_a,n_z);
V_next = rand(n_aprime,n_z);
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
RetMat = reshape(RetMat,[n_d*n_aprime,n_a,n_z]); %(d*a',a,z)
%% CPU version
tic
% Initialize output
V0 = zeros(n_a,n_z);
Policy1 = ones(n_a,n_z);
for z_c=1:n_z
EVz = V_next*pi_z(z_c,:)'; %(aprime,1)
%EVz_big = kron(EVz,ones(n_d,1));
EVz_big = repelem(beta*EVz,n_d);
for a_c=1:n_a
[max_val,max_ind] = max(RetMat(:,a_c,z_c) + EVz_big); %(d*a',1)r
%[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],max_ind);
V0(a_c,z_c) = max_val;
Policy1(a_c,z_c)=max_ind;
end
end
Policy1_all = zeros(2,n_a,n_z);
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
cpuSpeed = toc;
%% GPU version including transfers
dev = gpuDevice();
tic
V_nextGPU = gpuArray(V_next);
pi_zGPU = gpuArray(pi_z);
RetMatGPU = gpuArray(RetMat);
% Initialize output
GPU_V1 = zeros(n_a,n_z,"gpuArray");
Policy1 = ones(n_a,n_z,"gpuArray");
for z_c=1:n_z
EVz = V_nextGPU*pi_zGPU(z_c,:)';
EVz_big = repelem(EVz,n_d);
entireRHS = RetMatGPU(:,:,z_c) + beta*EVz_big;
[max_val,max_ind] = max(entireRHS,[],1); %(1,a)
GPU_V1(:,z_c) = max_val;
Policy1(:,z_c)=max_ind;
end
Policy1_all = zeros(2,n_a,n_z,"gpuArray");
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
wait(dev); % Need to this to make sure the GPU has finished
gpuWithTransfer = toc;
%% GPU version excluding transfers
dev = gpuDevice();
V_nextGPU = gpuArray(V_next);
pi_zGPU = gpuArray(pi_z);
RetMatGPU = gpuArray(RetMat);
tic
% Initialize output
GPU_V1 = zeros(n_a,n_z,"gpuArray");
Policy1 = ones(n_a,n_z,"gpuArray");
for z_c=1:n_z
EVz = V_nextGPU*pi_zGPU(z_c,:)'; %(aprime,1)
EVz_big = repelem(EVz,n_d);
entireRHS = RetMatGPU(:,:,z_c) + beta*EVz_big;
[max_val,max_ind] = max(entireRHS,[],1); %(1,a)
GPU_V1(:,z_c) = max_val;
Policy1(:,z_c)=max_ind;
end
Policy1_all = zeros(2,n_a,n_z,"gpuArray");
[d_opt,aprime_opt] = ind2sub([n_d,n_aprime],Policy1);
Policy1_all(1,:,:) = d_opt;
Policy1_all(2,:,:) = aprime_opt;
wait(dev); % Need to this to make sure the GPU has finished
gpuNoTransfer = toc;
fprintf("CPU version : %f seconds\n",cpuSpeed)
fprintf("GPU including transfer time to the GPU: %f seconds \n",gpuWithTransfer)
fprintf("GPU assuming data starts on GPU: %f seconds \n",gpuNoTransfer)
err = max(abs(GPU_V1-V0),[],"all");
fprintf("Maximum error is %e\n",err)
On my machine I get
>> speedThingGPU
CPU version : 0.395375 seconds
GPU including transfer time to the GPU: 0.349120 seconds
GPU assuming data starts on GPU: 0.010866 seconds
Maximum error is 2.220446e-16
On my machine, the CPU and the GPU versions are about the same speed if you have to include the time taken to transfer the data to the machine. I do this in the most boneheaded way possible
V_nextGPU = gpuArray(V_next);
pi_zGPU = gpuArray(pi_z);
RetMatGPU = gpuArray(RetMat);
This is what takes most of the time for this calculation. However, this is being deliberately unfair to the GPU to demonstrate the point. You don't include the creation time of those matrices in any of your other versions and we could have easily created everything directly on the GPU if we wanted. For example,
RetMatGPU = rand(n_d,n_aprime,n_a,n_z,"gpuArray");
Random number generation is much faster on the GPU than the CPU. If we did include creation time in BOTH CPU and GPU versions, the GPU would win overall.
So, if you can construct your starting matrices on the GPU you will probably win. If you have to transfer large matrices to the GPU, things are not looking so good
  2 commentaires
Alessandro
Alessandro le 6 Déc 2024 à 17:32
Modifié(e) : Alessandro le 6 Déc 2024 à 17:32
Thanks a lot for your detailed answer! The method on the GPU works very well also on my machine. An interesting point that I have learnt from yours and others' answers is that on the cpu vectorizing (i.e. eliminating loops with array-based intructions) does not always improve speed, while on the gpu the more you vectorize the better it is (subject to total memory limitation of course). Is it because Matlab has become good with loops thanks to JIT (and the JIT happens only on cpu)?
Joss Knight
Joss Knight le 7 Déc 2024 à 16:01
The JIT is almost certainly part of the answer, as well as some of the other things mentioned in the thread.The GPU also does some JITting but it doesn't (yet) know how to vectorize loops. Remember that all the GPUs performance comes from data parallelism so all a JIT would be doing would be working out how to write your loops as a single vectorized expression. So it's unlikely it would be able to do better than you can manually.

Connectez-vous pour commenter.


Umar
Umar le 5 Déc 2024 à 4:01

Hi @Alessandro,

Please see “Recommendations for Performance Improvement” below

1. Avoid Large Temporary Arrays: Instead of using `repelem`, consider avoiding temporary arrays altogether. You can directly compute values without expanding them into larger matrices.

2. Use Logical Indexing: If applicable, replace some array operations with logical indexing or direct computations that avoid unnecessary replication of data.

3. Preallocate Arrays Wisely: Ensure that all arrays are preallocated appropriately without excessive resizing during iterations.

4. Parallel Processing: If your MATLAB version supports it, consider using parallel processing capabilities (like `parfor`) to distribute computations across multiple cores.

5. Profile the Code: Use MATLAB's built-in profiler (`profile on`, `profile viewer`) to identify bottlenecks in your code execution and focus optimization efforts on those areas.

6. Consider Built-in Functions: Some operations might have optimized built-in functions that can perform better than custom implementations.

Here's a modified version of Method 2 that addresses some of these issues:

disp('Optimized Method 2')
tic
V2 = zeros(n_a, n_z);
Policy2 = zeros(n_a, n_z); % Changed to zeros instead of ones
for z_c = 1:n_z
  EVz = V_next * pi_z(z_c,:)'; 
  % Directly calculate entireRHS without expanding
  for a_c = 1:n_a
      entireRHS = RetMat(:, a_c, z_c) + beta * EVz; 
      [max_val, max_ind] = max(entireRHS); 
      V2(a_c, z_c) = max_val;
      Policy2(a_c, z_c) = max_ind;
  end
end
% Reshape policy outputs as before
Policy2_all = zeros(2, n_a, n_z);
[d_opt, aprime_opt] = ind2sub([n_d, n_aprime], Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
toc

After implementing changes, it's crucial to profile the new version against both original methods to quantify performance improvements. Investigate other forms of vectorization such as `bsxfun` or implicit expansion features in newer MATLAB versions that might offer better performance with less memory overhead.

By following these strategies and continually iterating on your approach based on profiling results, you should be able to enhance the performance of your MATLAB code effectively.

  2 commentaires
Alessandro
Alessandro le 5 Déc 2024 à 13:18
Modifié(e) : Alessandro le 5 Déc 2024 à 13:24
Thanks for your answer, but your code generates an error:
```
Optimized Method 2
Arrays have incompatible sizes for this operation.
Error in idea (line 75)
entireRHS = RetMat(:, a_c, z_c) + beta * EVz;
^^^^^^^^^
Related documentation
```
The reason is that the array EVz has size [n_a,1] while RetMat(:, a_c, z_c) has size [n_d*n_a,1]. Your suggestion is to avoid using repelem but this is precisely why I was using it: you have to make EVz comformable with RetMat.
By the way, I had spent a bit of effort into generating a minimum working example to test modifications and improvements to the code. Whenever you come up with a modification, you can run it and see (1) if it gives error and (2) if the results are still correct (the two variables err at the end of the MWE must be both equal to zero).
Umar
Umar le 6 Déc 2024 à 3:00

Hi @Alessandro ,

To address the issues you're facing with your MATLAB code, I took care of the errors that you were facing and created a full, updated example that incorporates the optimization strategies while ensuring that array sizes are compatible. The goal was to enhance performance without generating errors. Below, I’ll define all necessary functions and parameters, using synthetic data to ensure the code is functional and correct.

% Define synthetic data
n_a = 5;       % Number of actions
n_z = 3;       % Number of states
n_d = 4;       % Number of dimensions in RetMat
beta = 0.95;   % Discount factor
% Generate synthetic RetMat
RetMat = rand(n_d * n_a, n_z); % Random returns matrix
pi_z = rand(n_z, n_z);         % Random policy matrix (transition probabilities)
V_next = rand(n_a, n_z);       % Expected value for next period
% Initialize value and policy matrices
V2 = zeros(n_a, n_z);
Policy2 = zeros(n_a, n_z);
% Optimization Process
disp('Optimized Method 2')
tic
for z_c = 1:n_z
  EVz = V_next(:, z_c);  % Keep EVz as [n_a, 1]
    for a_c = 1:n_a
        entireRHS = RetMat((a_c-1)*n_d + (1:n_d), z_c) + beta * EVz(a_c);
        [max_val, max_ind] = max(entireRHS);
        V2(a_c, z_c) = max_val;
        Policy2(a_c, z_c) = max_ind;
    end
  end
% Reshape policy outputs
Policy2_all = zeros(2, n_a, n_z);
[d_opt, aprime_opt] = ind2sub([n_d, n_a], Policy2);
Policy2_all(1,:,:) = d_opt;
Policy2_all(2,:,:) = aprime_opt;
toc
% Display results
disp('Value Function:');
disp(V2);
disp('Policy Indices:');
disp(Policy2);

Please see attached.

In the above provided modified code, following are the explanations of code below.

1. Array Size Compatibility: The original issue arose because `EVz` was not conformable with `RetMat`. In this update, I ensured that `EVz` is extracted correctly as a column vector corresponding to the current action `a_c`. The indexing of `RetMat` was adjusted to ensure it aligns properly with the number of dimensions.

2. Avoiding Large Temporary Arrays: By directly computing values without creating unnecessary large matrices (like `repelem`), minimize memory usage and improve computational efficiency.

3. Logical Indexing: The revised indexing within the loops ensures that we avoid unnecessary data replication.

4. Preallocation: maintain preallocation practices with `V2` and `Policy2`, which helps prevent dynamic resizing during iterations.

5. Profiling and Performance Monitoring: After implementing these changes, it is recommended to use MATLAB's profiler (`profile on`, `profile viewer`) to assess performance improvements and identify any remaining bottlenecks.

By following these modifications and employing profiling tools, you should be able to significantly enhance the performance of your MATLAB code while avoiding errors associated with array size mismatches.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Get Started with GPU Coder dans Help Center et File Exchange

Produits


Version

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by