Main Content

Accelerate MIMO-OFDM Link Simulation Using GPU

Since R2024a

This example shows how to accelerate a multiple input multiple output (MIMO) orthogonal frequency division multiplexing (OFDM) link simulation with low-density parity-check (LDPC) channel coding using a graphical processing unit (GPU).


The example accelerates the simulation of a MIMO OFDM link by using a GPU. As many functions in MATLAB automatically run on a GPU if you supply a gpuArray (Parallel Computing Toolbox) data argument, you can accelerate the simulation by simply generating the information bits as a gpuArray (Parallel Computing Toolbox). The figure below depicts the simulated link components. The gpuArray (Parallel Computing Toolbox) objects flow through each link component executing its computations on the GPU. Note that using a GPU can accelerate a wireless link simulation if the link components have a high number of computations that can be executed in parallel.


The example compares runtime performance of one point for a BER curve for running on:

  • A single AMD EPYC 7276 CPU worker.

  • A pool of 16 AMD EPYC 7276 CPU workers.

  • NVIDIA RTX A5000 GPU with 24 GB memory.

The timing results show that when each component in the system has a high number of parallelizable computations, the following acceleration factors can be achieved when compared to a single CPU worker:

  • A pool of 16 CPU workers: up to a factor of 11.1x.

  • A single GPU: up to a factor of 32x.

In the next section, you learn how to run the link simulation for one iteration on the GPU. Afterwards, you see how using GPU processing becomes more beneficial when the simulation is set up to execute a large number of calculations in parallel.

Detect GPU Existence

if canUseGPU
    gpuExist = true;
    D = gpuDevice;
    fprintf('GPU detected, %s Device, %d multiprocessors, %s compute capability.\n', ...
        D.Name, D.MultiprocessorCount, D.ComputeCapability);
    gpuExist = false;
    warning(['Could not find an appropriate GPU. ' ...
        'The example will skip all GPU simulations.']);
GPU detected, NVIDIA RTX A5000 Device, 64 multiprocessors, 8.6 compute capability.

Set Simulation Parameters

In a wireless communications link, GPU processing can be useful for simulating a system with many transmit or receive antennas, encoding multiple codewords in parallel, modulating many OFDM symbols or simulating a fading channel with many fading paths. Start with simulating a link with a short LDPC block length of 648 bits, 2 transmit (Tx) antennas, 4 receive (Rx) antennas and 50 codewords per frame. In the later sections, you see how a system with higher parameters achieves higher GPU utilization and a higher acceleration factor.

MIMO & Beamforming Matrix

Set the number of transmitted data streams to 2. Map the data streams to 2 antennas using an analog beamforming matrix with consecutive phase shifters.

numTx = 2;
numRx = 4;
numStreams = 2;
fc = 1e9;
lambda = physconst('LightSpeed')/fc;
beamAngles = 15;
antIdx = 0:numTx-1;
antDelay = 2*pi*sin(2*pi*beamAngles*(0:numStreams-1).'/360)/lambda;
B = single(exp(1i*antIdx.*antDelay));

Channel Coding

Use a 16-QAM modulator and transmit 50 LDPC codewords in each OFDM symbol. Use an LDPC encoder with coding rate 5/6 and block length 648.

numBitsPerCarrier = 4;
numCWPerFrame = 50;
crcCfg = crcConfig;
ldpcEncCfg = ldpcEncoderConfig;
ldpcDecCfg = ldpcDecoderConfig;
numMsgBits = ldpcEncCfg.NumInformationBits-16;
modOrder = 2^numBitsPerCarrier;

OFDM Modulation

Set FFT length to 512, cyclic prefix length to 64 and use 20 null subcarriers for the guard band on each side.

fftLength = 512;
cpLength = fftLength/8;
ofdmNullIdx = [1:19 ( fftLength/2+1) (fftLength-19: fftLength)]'; % Guard bands and DC subcarrier
numDataSC = fftLength-length(ofdmNullIdx);
activeIdx = setdiff(1:fftLength,ofdmNullIdx);
numOFDMSym = ceil(numCWPerFrame * ldpcEncCfg.BlockLength / numBitsPerCarrier / numDataSC/ numStreams);
paddingSymbols = (numDataSC - mod(numCWPerFrame * ldpcEncCfg.BlockLength /numBitsPerCarrier,numDataSC));

Fading Channel and AWGN

Set the energy per bit to noise power ratio (EbNo) to 0.5 dB. Set the subcarrier spacing to 60 KHz and the maximum Doppler shift to 5 Hz. Use a Rayleigh fading channel with 24 fading paths.

EbNodB = 0.5;

snrdB = convertSNR(EbNodB,"ebno", ...
    BitsPerSymbol=numBitsPerCarrier, ...

SCS = 60;
fs = SCS*1e3*fftLength/0.85;
maxDoppler = 5;

pathDelays = [0.0000 0.2099 0.2219 0.2329 0.2176 ...
            0.6366 0.6448 0.6560 0.6584 0.7935 ...
            0.8213 0.9336 1.2285 1.3083 2.1704 ...
            2.7105 4.2589 4.6003 5.4902 5.6077 ...
            6.3065 6.6374 7.0427 8.6523]; % TDL-C delay profile

delaySpread = 10*1e-9; % (ns)

avgPathGains = [-4.4 -1.2 -3.5 -5.2 -2.5 ...
            0.0 -2.2 -3.9 -7.4 -7.1 ...
            -10.7 -11.1 -5.1 -6.8 -8.7 ....
            -13.2 -13.9 -13.9 -15.8 -17.1 ....
            -16.0 -15.7 -21.6 -22.8]; % TDL-C delay profile

mimoChannel = comm.MIMOChannel( ...
    SampleRate=fs, ...
    PathDelays=pathDelays*delaySpread, ...
    AveragePathGains=avgPathGains, ...
    SpatialCorrelationSpecification="None", ...
    NumTransmitAntennas=numTx, ...
    NumReceiveAntennas=numRx, ...
    MaximumDopplerShift=maxDoppler, ...
    FadingTechnique="Sum of sinusoids", ...

Simulate OFDM MIMO Link using GPU

By generating the initial data on the GPU and without making any further modifications to your code, all subsequent computations happen on the GPU. Measure the run time of one instance of the link on the GPU using tic and toc functions.

if gpuExist

tStart = tic;

% Generate information bits
msg = randi([0 1],numMsgBits,numCWPerFrame,"int8","gpuArray");

% Channel coding
crcgen = crcGenerate(msg, crcCfg);

The ldpcEncode function does not support GPU processing, use the gather (Parallel Computing Toolbox) function to move crcgen variable from the GPU memory to the CPU memory. Move the output of ldpcEncode function back to the GPU memory using gpuArray (Parallel Computing Toolbox) function.

encldpc = ldpcEncode(gather(crcgen),ldpcEncCfg);
encldpc = gpuArray(encldpc);

% Modulation
modsymbols = qammod(encldpc, modOrder, ...
    InputType="bit", ...
    UnitAveragePower=true, ...
txSigScaling = (fftLength/sqrt( numDataSC));
ofdmdata = txSigScaling*ofdmmod(reshape([modsymbols(:); ...
    zeros(paddingSymbols,1)],numDataSC,numOFDMSym,numStreams), ...

% Beamforming
txsig = ofdmdata*B;

% Fading channel with AWGN
[channelout,pathgains] = mimoChannel(txsig);
sigpowerdB = 10.*log10((numTx/numStreams)/numRx);
[rxsig, nvar] = awgn(channelout,snrdB,sigpowerdB);
pathfilters = info(mimoChannel).ChannelFilterCoefficients;
timingoffset = info(mimoChannel).ChannelFilterDelay;

% Perfect channel estimation
Hest = ofdmChannelResponse(pathgains,pathfilters,fftLength, ...
H = reshape(Hest,[],numTx,numRx);

% OFDM demodulation
% Handle timing offset before OFDM demodulation:
% Remove initial samples and pad zeros to keep signal length unchanged
shiftedSig = (sqrt(numDataSC) / fftLength)* ...
    [rxsig(timingoffset+1:end,:); zeros(timingoffset, numRx)];
% Sampling offset for OFDM demodulation
symoffset =  cpLength/2;
% The padded zeros are not used if timingOffset <= cpLen/2
dmodofdm = ofdmdemod(shiftedSig,fftLength,cpLength,symoffset,ofdmNullIdx);

% Frequency domain equalization
heffperm = pagemtimes( B, permute(H, [2,3,1]));
heff = permute(heffperm, [3, 1, 2]);
eqsym = ofdmEqualize(dmodofdm,heff,nvar,Algorithm="mmse");
eqsym(end- paddingSymbols+1:end) = [];

% QAM demodulation
dmodqam = qamdemod(reshape(eqsym,ldpcEncCfg.BlockLength/numBitsPerCarrier,[]),...

% Decoding
decldpc = ldpcDecode(dmodqam,ldpcDecCfg,10,Termination="max");
crcdet = crcDetect(decldpc, crcCfg);

To ensure that all the GPU computations are complete, you must call wait (Parallel Computing Toolbox) before calling toc.

% Error rate calculations
numBitErrors = biterr(msg, crcdet)
numBlockErrors = sum(biterr(msg, crcdet,"column-wise")>0)

Note that the gpuArray (Parallel Computing Toolbox) objects passes throughout the link, except for ldpcEncode function, executing every component on the GPU.

Elapsed time is 0.110380 seconds.
numBitErrors =

numBlockErrors =

ans = logical

Accelerate Link Using GPU Versus Parfor-loop

In this section, you simulate the link shown above on the CPU, on a parallel pool of 16 CPU workers and on the GPU. Then you compare the run time, BER and BLER of each simulation. Set maxNumFrames to 200 and maxNumBlockErrors to 100 to run this section quickly. For more meaningful results, set maxNumFrames to at least 1000. Note that the simulation on the CPU might take up to 20 minutes for 1000 frames.

maxNumFrames = 200;
maxNumBlockErrors = 100;

simParam = helperLinkSetup(numTx,numRx,numStreams,numBitsPerCarrier,numCWPerFrame,fftLength);

Set UseGPU flag in the simParam structure to false to run the link on the CPU. The helperLDPCMIMOOFDMSimulation function simulates the link shown in the previous section for up to maxNumFrames or until maxNumBlockErrors occurs.

simParam.UseGPU =false;
simOutputCPU = helperLDPCMIMOOFDMSimulation(maxNumFrames,maxNumBlockErrors,EbNodB,simParam);
Start simulating link on the CPU ...
Simulation on the CPU is DONE.
simToCompare = simOutputCPU;

Parfor-loop Parallelization Strategy

Use the helperLDPCMIMOOFDMSimulationParfor function to run the same simulation on a pool of 16 CPU workers. The function uses the existing parallel pool if one is open. Otherwise, it opens one with the default parallel profile.

To achieve high acceleration for BLER Monte-Carlo simulations with parfor (Parallel Computing Toolbox), distribute the maximum number of iterations evenly across the parallel workers for all EbNo values. Otherwise, if you set each worker to run all iterations for one EbNo value, the acceleration factor will be limited by the run time of the highest EbNo value until the specified number of errors occurs. For a comparison between the two parallelization techniques, see Simulation Acceleration Using MATLAB Coder and Parallel Computing Toolbox.

numFramesPerWrkr = 10;
simOutputParfor = helperLDPCMIMOOFDMSimulationParfor(maxNumFrames,maxNumBlockErrors, ...
Start simulating link using parfor ...
Simulation using parfor is DONE.
simToCompare = [simToCompare, simOutputParfor];

GPU Parallelization Strategy

To run the link on the GPU, set UseGPU flag in the simParam structure to true. With no other modifications to the helperLDPCMIMOOFDMSimulation function, it generates the information bits as gpuArray (Parallel Computing Toolbox) objects executing the link on the GPU.

if gpuExist
simParam.UseGPU = true;
simParam.Device = D;
simOutputGPU = helperLDPCMIMOOFDMSimulation(maxNumFrames,maxNumBlockErrors,EbNodB,simParam);
simToCompare = [simToCompare, simOutputGPU];
Start simulating link on the GPU ...
Simulation on the GPU is DONE.

Speed-up Results

Use helperCompareSimulatedLinks function to compare the BER, BLER and run time of each simulation. The GPU accelerates the simulation with a factor of 3.7x whereas a parallel pool of 16 workers offers an acceleration factor of 11.1x. Note that a parfor (Parallel Computing Toolbox) loop parallelizes over the simulation iterations offering a higher speed-up factor for systems with low parallelizable components. In the next section, you see how the GPU acceleration factor increases when simulating a highly parallelizable system.

For reproducibility, the helperLDPCMIMOOFDMSimulation function fixes the random number generator algorithm, seed, and normal transformation on the CPU and the GPU. For details on random number generators supported in both environments, see Random Number Streams on a GPU (Parallel Computing Toolbox). The parfor (Parallel Computing Toolbox) loop, however, uses a different random number stream on each worker resulting in slightly different results. For additional details, see Control Random Number Streams on Workers (Parallel Computing Toolbox).

Environment            | EbNo(dB)| BER          | Num Bits   | BLER         | Num Blocks | Elapsed Time(sec)   | Speed-up factor
CPU                    | 0.50    | 6.3044e-05   | 5266200   | 3.2836e-03   | 10050      | 22.63               |  1.00 
parfor                 | 0.50    | 3.8836e-04   | 4192000   | 2.2875e-02   | 8000       | 2.03                |  11.16 
GPU                    | 0.50    | 6.3044e-05   | 5266200   | 3.2836e-03   | 10050      | 6.12                |  3.70 

Accelerate a Link with Highly Parallelizable Components

Increase the number of transmit antennas to 64, number of receive antennas to 8, and use an LDPC code with block length 64800 and coding rate 1/2. Set the number of bits per carrier to 6 and EbNo to 3.5 dB. Decrease the number of codewords per OFDM symbol and number of data streams to 1 to avoid running out of memory on the GPU for the new block length. You see how the GPU can accelerate the link simulation with a higher factor as you increase its utilization. Set the maxNumFrames to 1000.

maxNumFrames = 1000;
maxNumBlockErrors = 200;
EbNodB = 3.5;

numCWPerFrame = 1;
numStreams = 1;
numBitsPerCarrier = 6;
numTx = 64;
numRx = 8;

load parityCheckMatrix.mat
simParam = helperLinkSetup(numTx,numRx,numStreams,numBitsPerCarrier,numCWPerFrame,fftLength,parityCheck);

Running this section on the CPU takes a long time. By default, the example loads pre-saved results for the simulation using the CPU and parfor (Parallel Computing Toolbox) loop. Set the runNow flag to true to run the simulation yourself. Note that it could take more than 20 minutes to complete.

runNow = false;
if ~runNow
    load simulationResultsCPU.mat
    simParam.UseGPU = false; %#ok<UNRCH>
    simOutputCPU = helperLDPCMIMOOFDMSimulation(maxNumFrames,maxNumBlockErrors,EbNodB,simParam);

    numFramesPerWrkr = 10;
    simOutputParfor = helperLDPCMIMOOFDMSimulationParfor(maxNumFrames,maxNumBlockErrors, ...
simToCompare = [simOutputCPU, simOutputParfor];

Set UseGPU flag in the simParam structure to true to run the link on the GPU.

if gpuExist
simParam.UseGPU = true;
simParam.Device = D;
simOutputGPU = helperLDPCMIMOOFDMSimulation(maxNumFrames,maxNumBlockErrors,EbNodB,simParam);
simToCompare = [simToCompare, simOutputGPU];
Start simulating link on the GPU ...
Simulation on the GPU is DONE.

Speed-up Results

Compare the simulation results. The GPU shows the highest acceleration factor of 32x when compared to using a parallel pool of 16 workers which provides an acceleration factor of 6.7x. This is expected since the GPU parallelizes the computations within each component of the simulated link rather than parallelizing over the simulation iterations.

Environment            | EbNo(dB)| BER          | Num Bits   | BLER         | Num Blocks | Elapsed Time(sec)   | Speed-up factor
CPU                    | 3.50    | 4.8559e-04   | 26522496   | 2.4542e-01   | 819        | 1466.98             |  1.00 
parfor                 | 3.50    | 8.6076e-05   | 25907200   | 4.2625e-01   | 800        | 218.45              |  6.72 
GPU                    | 3.50    | 4.8559e-04   | 26522496   | 2.4542e-01   | 819        | 45.71               |  32.09 

BLER Curves

The figure below compares the BLER of the previous section for EbNo values ranging between -3 and 5 dB. Note that the CPU and GPU curves are identical since the same random number generator is used. The curve simulated using the parfor (Parallel Computing Toolbox) loop is slightly different since every worker in the parallel pool uses a different random number stream.

To regenerate these curves, set the following parameters and rerun the previous section:

  • maxNumBlockErrors = 500

  • EbNodB = -3 : 5

The figure below compares the run times of the BLER simulations. The CPU takes >12000 (200 minutes) seconds, a parallel pool of 16 CPU workers takes about 1700 seconds (28.3 minutes) achieving a speed-up factor of 7.2x. Finally, the GPU takes about 400 seconds (6.6 minutes) with a speedup factor of 30.2x for simulating the same BLER curve without losing any numeric accuracy compared to the CPU.



This example showed how to run a MIMO OFDM link simulation on the GPU by generating the transmitted data bits as gpuArray (Parallel Computing Toolbox) objects. You compared the acceleration gained by using a single GPU versus a parallel pool of 16 CPU workers against using one CPU worker. You learned that higher acceleration gains can be achieved using the GPU for systems with highly parallelizable components.