Speeding up nested for-loops when vectorization seems to fail

Question

Adam Pigon le 13 Fév 2020

1
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail

Commenté : Adam Pigon le 19 Fév 2020

Dear All, I have the following problem.

There are relatively simple operations of barycentric interpolation in a large, multi-dimensional array, which leads to long computing times. The code is given as follows:

clear all;
clc;
% auxilliary matrices and vectors
na = 100;
nm = 100;
nL = 10;
a  = 1:na;
m  = 1:nm;
[ap_grid, mp_grid] = ndgrid(a,m);
[a_grid, m_grid]   = ndgrid(a,m);
A                  = randn(100,na,nm,2,2,3); 
M                  = randn(100,na,nm,2,2,3);
AP_int1            = zeros(100,na,nm,2,2,3,nL);
MP_int1            = zeros(100,na,nm,2,2,3,nL);
AP_int2            = zeros(100,na,nm,2,2,3,nL);
MP_int2            = zeros(100,na,nm,2,2,3,nL);
a_gridVEC          = repmat( permute( a_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);  
m_gridVEC          = repmat( permute( m_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);  
% looped computations
tic
for q1 = 1:100
    for q2 = 1:2
        for q3 = 1:2
            for q4 = 1:3
                
                for r1 = 1:nL % a time consuming loop
                
                ap1 = ap_grid(1,1); % in the real situation points ap and mp change every iteration
                ap2 = ap_grid(1,2); 
                ap3 = ap_grid(2,1); 
                mp1 = mp_grid(1,1); 
                mp2 = mp_grid(1,2); 
                mp3 = mp_grid(2,1); 
                
                a1 = A(q1,1,1,q2,q3,q4); % in the real situation the 2nd and 3rd indices of A and M change every iteration
                a2 = A(q1,1,2,q2,q3,q4); 
                a3 = A(q1,2,1,q2,q3,q4); 
                m1 = M(q1,1,1,q2,q3,q4); 
                m2 = M(q1,1,2,q2,q3,q4); 
                m3 = M(q1,2,1,q2,q3,q4);
                
                for r2 = 1:na 
                    for r3 = 1:nm
                    a_int1 = a_grid(r2,r3); 
                    m_int1 = m_grid(r2,r3);
                
                    % barycentric interpolation (weights)
                    w1 = ( (m2-m3)*(a_int1-a3) + (a3-a2)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );
                    w2 = ( (m3-m1)*(a_int1-a3) + (a1-a3)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );
                    w3 = 1 - w1 - w2;
                    
                    % barycentric interpolation (results)
                    if w1 < -0.25 || w2 < -0.25 || w3 < -0.25 % we allow for only 'a bit' of extrapolation
                    AP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN; 
                    MP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN;
                    else
                    AP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * ap1 + w2 * ap2 + w3 * ap3; 
                    MP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * mp1 + w2 * mp2 + w3 * mp3;
                    end
                    end % loop with nm iterations
                end % loop with na iterations
                end % a time consuming loop
                
            end
        end
    end
end
toc
% vectorized computations
tic               
for r1 = 1:nL % a time consuming loop (in the real situation points ap,mp,a and m change every iteration of this loop)
ap1 = repmat( ap_grid(1,1), 100,na,nm,2,2,3); 
ap2 = repmat( ap_grid(1,2), 100,na,nm,2,2,3);  
ap3 = repmat( ap_grid(2,1), 100,na,nm,2,2,3);  
mp1 = repmat( mp_grid(1,1), 100,na,nm,2,2,3);  
mp2 = repmat( mp_grid(1,2), 100,na,nm,2,2,3);  
mp3 = repmat( mp_grid(2,1), 100,na,nm,2,2,3);  
a1 = repmat( A(:,1,1,:,:,:), 1,na,nm,1,1,1);
a2 = repmat( A(:,1,2,:,:,:), 1,na,nm,1,1,1);
a3 = repmat( A(:,2,1,:,:,:), 1,na,nm,1,1,1); 
m1 = repmat( M(:,1,1,:,:,:), 1,na,nm,1,1,1); 
m2 = repmat( M(:,1,2,:,:,:), 1,na,nm,1,1,1); 
m3 = repmat( M(:,2,1,:,:,:), 1,na,nm,1,1,1);
% barycentric interpolations (weights)
w1 = ( (m2-m3).*(a_gridVEC-a3) + (a3-a2).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );
w2 = ( (m3-m1).*(a_gridVEC-a3) + (a1-a3).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );
w3 = 1 - w1 - w2;
w1(w1<-0.25) = NaN; % we allow for only 'a bit' of extrapolation
w2(w2<-0.25) = NaN;
w3(w3<-0.25) = NaN;
  
% barycentric interpolations (results)
AP_int1(:,:,:,:,:,:,r1) = w1 .* ap1 + w2 .* ap2 + w3 .* ap3;
MP_int1(:,:,:,:,:,:,r1) = w1 .* mp1 + w2 .* mp2 + w3 .* mp3;
                
end    
toc

Unfortunately, vectorization of these looped operations makes computational times only worse, contradicting the common Matlab knowledge that vectorization always helps. Why is it so? Is there any other way of speeding up the code? The parfor loop also slows down the code. Would it help to write this piece of code in C++ and call it from Matlab?

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Stephen23 le 14 Fév 2020

Modifié(e) : Stephen23 le 14 Fév 2020

"...contradicting the common Matlab knowledge that vectorization always helps."

It doesn't.

"Why is it so?"

Because that "common Matlab knowledge" is incorrect.

Vectorized code is often much faster than using loops, because the underlying operations can be performed very efficiently using highly optimized code (usually written in C). But not always faster, as you write. That is simply incorrect. Particularly in cases where the vectorized code generates large intermediate arrays, vectorized code can easily be much slower than looped code. The real "common Matlab knowledge" requires the user to consider memory consumption, array allocation, etc. and if it really is critical, measure using the profiler, etc.

Whoever told you that "vectorization always helps" does not understand MATLAB.

Subhadeep Koley le 14 Fév 2020

@ Stephen Cobeldick +1 Very nice explanation.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Matt J le 14 Fév 2020

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#answer_415563

Modifié(e) : Matt J le 14 Fév 2020

Ouvrir dans MATLAB Online

Getting rid of repmat (requires R2016b or later) and working with single float precision will get you some speed-up.

In doubles:

Elapsed time is 5.396878 seconds.
Elapsed time is 4.874902 seconds.

In singles:

Elapsed time is 3.312500 seconds.
Elapsed time is 2.872178 seconds.

% auxilliary matrices and vectors
na = 100;
nm = 100;
nL = 10;
a  = 1:na;
m  = 1:nm;
[ap_grid, mp_grid] = ndgrid(a,m);
[a_grid, m_grid]   = ndgrid(a,m);
A                  = randn(100,na,nm,2,2,3); 
M                  = randn(100,na,nm,2,2,3);
AP_int1            = zeros(100,na,nm,2,2,3,nL);
MP_int1            = zeros(100,na,nm,2,2,3,nL);
AP_int2            = zeros(100,na,nm,2,2,3,nL);
MP_int2            = zeros(100,na,nm,2,2,3,nL);
a_gridVEC          = repmat( permute( a_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);  
m_gridVEC          = repmat( permute( m_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);   
C=cellfun(@single,{A,M,AP_int1,MP_int1 , AP_int2, MP_int2, a_gridVEC,m_gridVEC,ap_grid, mp_grid, a_grid, m_grid},'uni',0);
[A,M,AP_int1,MP_int1 , AP_int2, MP_int2, a_gridVEC,m_gridVEC,ap_grid, mp_grid, a_grid, m_grid]=deal(C{:});
% looped computations
tic
for q1 = 1:100
    for q2 = 1:2
        for q3 = 1:2
            for q4 = 1:3
                
                for r1 = 1:nL % a time consuming loop
                
                ap1 = ap_grid(1,1); % in the real situation points ap and mp change every iteration
                ap2 = ap_grid(1,2); 
                ap3 = ap_grid(2,1); 
                mp1 = mp_grid(1,1); 
                mp2 = mp_grid(1,2); 
                mp3 = mp_grid(2,1); 
                
                a1 = A(q1,1,1,q2,q3,q4); % in the real situation the 2nd and 3rd indices of A and M change every iteration
                a2 = A(q1,1,2,q2,q3,q4); 
                a3 = A(q1,2,1,q2,q3,q4); 
                m1 = M(q1,1,1,q2,q3,q4); 
                m2 = M(q1,1,2,q2,q3,q4); 
                m3 = M(q1,2,1,q2,q3,q4);
                
                for r2 = 1:na 
                    for r3 = 1:nm
                    a_int1 = a_grid(r2,r3); 
                    m_int1 = m_grid(r2,r3);
                
                    % barycentric interpolation (weights)
                    w1 = ( (m2-m3)*(a_int1-a3) + (a3-a2)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );
                    w2 = ( (m3-m1)*(a_int1-a3) + (a1-a3)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );
                    w3 = 1 - w1 - w2;
                    
                    % barycentric interpolation (results)
                    if w1 < -0.25 || w2 < -0.25 || w3 < -0.25 % we allow for only 'a bit' of extrapolation
                    AP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN; 
                    MP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN;
                    else
                    AP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * ap1 + w2 * ap2 + w3 * ap3; 
                    MP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * mp1 + w2 * mp2 + w3 * mp3;
                    end
                    end % loop with nm iterations
                end % loop with na iterations
                end % a time consuming loop
                
            end
        end
    end
end
toc
% vectorized computations
tic               
for r1 = 1:nL % a time consuming loop (in the real situation points ap,mp,a and m change every iteration of this loop)
ap1 = ( ap_grid(1,1)); 
ap2 = ( ap_grid(1,2));  
ap3 = ( ap_grid(2,1));  
mp1 = ( mp_grid(1,1));  
mp2 = ( mp_grid(1,2));  
mp3 = ( mp_grid(2,1));  
a1 = ( A(:,1,1,:,:,:));
a2 = ( A(:,1,2,:,:,:));
a3 = ( A(:,2,1,:,:,:)); 
m1 = ( M(:,1,1,:,:,:)); 
m2 = ( M(:,1,2,:,:,:)); 
m3 = ( M(:,2,1,:,:,:));
% barycentric interpolations (weights)
w1 = ( (m2-m3).*(a_gridVEC-a3) + (a3-a2).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );
w2 = ( (m3-m1).*(a_gridVEC-a3) + (a1-a3).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );
w3 = 1 - w1 - w2;
w1(w1<-0.25) = NaN; % we allow for only 'a bit' of extrapolation
w2(w2<-0.25) = NaN;
w3(w3<-0.25) = NaN;
  
% barycentric interpolations (results)
AP_int1(:,:,:,:,:,:,r1) = w1 .* ap1 + w2 .* ap2 + w3 .* ap3;
MP_int1(:,:,:,:,:,:,r1) = w1 .* mp1 + w2 .* mp2 + w3 .* mp3;
                
end    
toc

2 commentaires
Afficher AucuneMasquer Aucune

Matt J le 14 Fév 2020

Modifié(e) : Matt J le 14 Fév 2020

Ouvrir dans MATLAB Online

Using the gpuArrays would also help the vectorized version even further. The following was using the GeForce GTX 1050,

na = 50;
nm = 50;
nL = 10;
....
Elapsed time is 1.304855 seconds.  %double floats on CPU
Elapsed time is 0.521651 seconds.  %double floats on GPU
Elapsed time is 0.785172 seconds.  %single floats on CPU
Elapsed time is 0.278735 seconds.  %single floats on GPU

Adam Pigon le 19 Fév 2020

Thank you for your answer. I can only add that experimenting with the loop ordering (matrix column vs row indexing) and vectorizing only certain parts of the problem can speed up the code a bit.

Connectez-vous pour commenter.

Answer 2

Matt J le 14 Fév 2020

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#answer_415559

Modifié(e) : Matt J le 14 Fév 2020

I tend to think you should be using scatteredInterpolant rather than implementing your own interpolation routine with loops.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Adam Pigon le 19 Fév 2020

Although scatteredInterpolant could be a solution in such problems, in this very case it is not. Interpolation serves here as an approximation for solving equations. As long as these equations have precisely one solution the scatteredInterpolant works well. Unfortunately, it stops working when we have multiple soluitions, which is a case in my problem.

Connectez-vous pour commenter.

Speeding up nested for-loops when vectorization seems to fail

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Réponse acceptée

2 commentaires
Afficher AucuneMasquer Aucune

Plus de réponses (1)

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

Speeding up nested for-loops when vectorization seems to fail

3 commentaires Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Réponse acceptée

2 commentaires Afficher AucuneMasquer Aucune

Plus de réponses (1)

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

2 commentaires
Afficher AucuneMasquer Aucune

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens