PARFOR loop is too slow

1 vue (au cours des 30 derniers jours)
rambiz
rambiz le 25 Juin 2013
Hi all, this is the code I wrote to simulate 8PSK in Rayleigh Channel. It is working and gives the correct results but it is too slow, compared to the case when I use an ordinary for-loop and run the code on only one core. Right now I run the "parfor-loop-code" on 4 cores. Is it because, I use too much memory for the all the {k}-indexed variables? How can I avoid this problem? TYVM
% 1 Transmit Antenna
% Nr Receive Antenna
% 3 bits per Transmission
clc;
clear all;
Nt = 1; % Tx antennas
Nr = 1; % Rx antennas
Ebit_N0_dB =0;
Es_N0_dec =1/Nr*3*10.^(Ebit_N0_dB/10); %Multiply by 3 because Symbol Energy is trice as much as Bit Energy for 3-Bit-Transmission
%divide by 1/Nr to normalize the received signal energy
%in other words every receiver should receive 1/Nr of the total energy received at the receiver side
BER = zeros(1,length(Ebit_N0_dB));
symbol_errors = zeros(1,length(Ebit_N0_dB));
bit_errors = zeros(1,length(Ebit_N0_dB));
% Defining the Gray Coded Constellation
demap =[0 0 0;0 0 1;0 1 0;0 1 1;1 0 0;1 0 1;1 1 0;1 1 1];
% 8PSK Symbols
S000=1;
S001=exp(1j*1*pi/4);
S010=exp(1j*3*pi/4);
S011=exp(1j*2*pi/4);
S100=exp(1j*7*pi/4);
S101=exp(1j*6*pi/4);
S110=exp(1j*4*pi/4);
S111=exp(1j*5*pi/4);
N=10^6;
symbols=0;
xjq=cell(1,N);
argmin=cell(1,N);
tic
parfor K=1:N
symbols=symbols+1
bin_ip{K} = round(rand(1,3)); % Generate Binary Input
%-------------------------------
if bin_ip{K}==[0 0 0]
xjq{K}=S000
elseif bin_ip{K}==[0 0 1]
xjq{K}=S001
elseif bin_ip{K}==[0 1 0]
xjq{K}=S010
elseif bin_ip{K}==[0 1 1]
xjq{K}=S011
elseif bin_ip{K}==[1 0 0]
xjq{K}=S100
elseif bin_ip{K}==[1 0 1]
xjq{K}=S101
elseif bin_ip{K}==[1 1 0]
xjq{K}=S110
elseif bin_ip{K}==[1 1 1]
xjq{K}=S111
end
n{K} = (1/sqrt(2))*(randn(Nr,1) + 1i*randn(Nr,1)); % white Gaussian Noise with variance of 1
%==================================================================
%Generate Channel Matrix
%==================================================================
H{K}=(1/sqrt(2)*( randn(Nr,Nt)+1i*randn(Nr,Nt)));
y{K} = sqrt(Es_N0_dec)*H{K}*xjq{K} + n{K}; %=================================================================
%Optimal Detection Scheme
=================================================================
p = sqrt(Es_N0_dec); %8PSK
y000{K}=H{K}*p*S000; argmin{K}(1)=norm(y{K}-y000{K});
y001{K}=H{K}*p*S001; argmin{K}(2)=norm(y{K}-y001{K});
y010{K}=H{K}*p*S010; argmin{K}(3)=norm(y{K}-y010{K});
y011{K}=H{K}*p*S011; argmin{K}(4)=norm(y{K}-y011{K});
y100{K}=H{K}*p*S100; argmin{K}(5)=norm(y{K}-y100{K});
y101{K}=H{K}*p*S101; argmin{K}(6)=norm(y{K}-y101{K});
y110{K}=H{K}*p*S110; argmin{K}(7)=norm(y{K}-y110{K});
y111{K}=H{K}*p*S111; argmin{K}(8)=norm(y{K}-y111{K});
[~,est(K)] = min(argmin{K});
est_bin{K} = demap(est(K),:);
%Count bit errors
bit_errors = bit_errors + length(find( est_bin{K}~=bin_ip{K}));
end
toc
BER = bit_errors/(symbols*3)

Réponses (2)

Matt J
Matt J le 25 Juin 2013
Modifié(e) : Matt J le 25 Juin 2013
You have a lot of on-the-fly memory allocation, which probably isn't helping things. H{k},y{k}, y000{k},etc... are not pre-allocated and get generated element-by-element in the loop. The contents of each argmin{K}, apparently a length 8 vector, are also not pre-allocated. Finally, it doesn't seem to make much sense holding these things in cell arrays. It looks like they all contain same-sized quantities and could be held instead in numeric arrays.
As another miscellaneous note, you can make things faster by skipping calls to find(),
bit_errors = bit_errors + sum( est_bin{K}~=bin_ip{K} );
  3 commentaires
Matt J
Matt J le 25 Juin 2013
sum(.) is definitely slower!
No way.
x=rand(1e8,1)>.5;
tic;
sum(x);
toc
%Elapsed time is 0.053452 seconds.
tic;
length(find(x));
toc
%Elapsed time is 0.904927 seconds.
rambiz
rambiz le 21 Juil 2013
Modifié(e) : rambiz le 21 Juil 2013
Hi guys, tyvm for your comments so far. OK, I changed the original code, taking Matt's suggestions into consideration, i.e. I preallocated memory and used normal arrays instead of cells. The code runs in about 105 seconds (for 10^5 runs) and it is still very slow. Using sum(.) or length(find(.)) doesn't change the speed considerably. Any new comments, please?!
% Nt Transmit Antenna
% Nr Receive Antenna
% 3 bits per Transmission
% clc;
clear all;
tic
Nt = 1; % Tx antennas
Nr = 1; % Rx antennas
Ebit_N0_dB =0;
Es_N0_dec =1/Nr*3*10.^(Ebit_N0_dB/10); %Multiply by 3 because Symbol Energy is trice as much as Bit Energy for 3-Bit-Transmission
%divide by 1/Nr to normalize the received signal energy
%in other words every receiver should receive 1/Nr of the total energy received at the receiver side
BER = zeros(1,length(Ebit_N0_dB));
symbol_errors = zeros(1,length(Ebit_N0_dB));
bit_errors = zeros(1,length(Ebit_N0_dB));
% symbols = zeros(1,length(Ebit_N0_dB));
% Defining the Gray Coded Constellation
demap =[0 0 0;0 0 1;0 1 0;0 1 1;1 0 0;1 0 1;1 1 0;1 1 1];
% % 8PSK Symbols
S000=1;
S001=exp(1j*1*pi/4);
S010=exp(1j*3*pi/4);
S011=exp(1j*2*pi/4);
S100=exp(1j*7*pi/4);
S101=exp(1j*6*pi/4);
S110=exp(1j*4*pi/4);
S111=exp(1j*5*pi/4);
N=5*10^5;
symbols=0;
bin_ip=round(rand(N,3));
xjq=zeros(1,N);
n= (1/sqrt(2))*(randn(Nr,N) + 1i*randn(Nr,N));
H=(1/sqrt(2)*( randn(Nr,Nt,N)+1i*randn(Nr,Nt,N)));
y=zeros(Nr,N);
y000=zeros(Nr,N);
y001=zeros(Nr,N);
y010=zeros(Nr,N);
y011=zeros(Nr,N);
y100=zeros(Nr,N);
y101=zeros(Nr,N);
y110=zeros(Nr,N);
y111=zeros(Nr,N);
argmin=zeros(8,N);
parfor K=1:N
symbols=symbols+1;
%-------------------------------
if bin_ip(K,:)==[0 0 0]
xjq(K)=S000;
elseif bin_ip(K,:)==[0 0 1]
xjq(K)=S001;
elseif bin_ip(K,:)==[0 1 0]
xjq(K)=S010;
elseif bin_ip(K,:)==[0 1 1]
xjq(K)=S011;
elseif bin_ip(K,:)==[1 0 0]
xjq(K)=S100;
elseif bin_ip(K,:)==[1 0 1]
xjq(K)=S101;
elseif bin_ip(K,:)==[1 1 0]
xjq(K)=S110;
elseif bin_ip(K,:)==[1 1 1]
xjq(K)=S111;
end
y(:,K) = sqrt(Es_N0_dec)*H(:,:,K)*xjq(K) + n(:,K); %8PSK
%Optimal Detection Scheme
% =============================================================
p = sqrt(Es_N0_dec); %8PSK
y000(:,K)=H(:,:,K)*p*S000;
y001(:,K)=H(:,:,K)*p*S001;
y010(:,K)=H(:,:,K)*p*S010;
y011(:,K)=H(:,:,K)*p*S011;
y100(:,K)=H(:,:,K)*p*S100;
y101(:,K)=H(:,:,K)*p*S101;
y110(:,K)=H(:,:,K)*p*S110;
y111(:,K)=H(:,:,K)*p*S111;
argmin(:,K)=[norm(y(:,K)-y000(:,K));
norm(y(:,K)-y001(:,K));
norm(y(:,K)-y010(:,K));
norm(y(:,K)-y011(:,K));
norm(y(:,K)-y100(:,K));
norm(y(:,K)-y101(:,K));
norm(y(:,K)-y110(:,K));
norm(y(:,K)-y111(:,K))];
[~,est(K)] = min(argmin(:,K));
est_bin{K} = demap(est(K),:);
%Count bit errors
bit_errors = bit_errors + length(find( est_bin{K}~=bin_ip(K,:)));
%bit_errors = bit_errors + sum( est_bin{K}~=bin_ip(K,:));
end
toc
BER = bit_errors/(symbols*3)

Connectez-vous pour commenter.


Muthu Annamalai
Muthu Annamalai le 25 Juin 2013
Modifié(e) : Muthu Annamalai le 25 Juin 2013
Key to using parfor must be the independence of each iteration; i.e. you cannot write variable updates like,
a = a + 1
So change the lines,
bit_errors = bit_errors + length(find( est_bin{K}~=bin_ip{K}));
bit_errors(k) = length(find( est_bin{K}~=bin_ip{K}));
and rewrite
BER = bit_errors/(symbols*3)
as
BER = sum(bit_errors)/(symbols*3)
and remove the line,
symbols=symbols+1
because this variable is going to be N at end of all iterations.
This should give you some speedup.
  2 commentaires
Matt J
Matt J le 25 Juin 2013
you cannot write variable updates like a=a+1
Why not? Don't they qualify as legitimate reduction assignments as described here?
Muthu Annamalai
Muthu Annamalai le 25 Juin 2013
@Matt I probably stand corrected in that case, while still my approach is explicitly parallelizable.

Connectez-vous pour commenter.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by