Main Content

Optimize Deep Learning Processor Configuration for Network Performance

This example shows how to generate a deep learning processor configuration and estimate the performance of a pretrained network. Generate a deep learning processor configuration optimized for the target frames-per-second value of the network, then generate a custom bitstream by using the optimized processor configuration.

Load Pretrained Network and Create Processor Configuration

To load a pretrained ResNet-18 network, enter:

net = resnet18;

Create a custom deep learning processor configuration. For more information, see dlhdl.ProcessorConfig.

hPC = dlhdl.ProcessorConfig;

Estimate Network Performance

Establish the baseline performance of the network, by estimating the performance of the ResNet-18 network. Estimate the performance, by using the estimatePerformance method of the dlhdl.ProcessorConfig object. The method returns the estimated layer latency, network latency, and network performance in frames per second.

estimatePerformance(hPC,net);
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.
### The network includes the following layers:
     1   'data'                              Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                             2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'conv1_relu'                        ReLU                         ReLU                                                                  (HW Layer)
     4   'pool1'                             2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     5   'res2a_branch2a'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     6   'res2a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
     7   'res2a_branch2b'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     8   'res2a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
     9   'res2a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    10   'res2b_branch2a'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    11   'res2b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    12   'res2b_branch2b'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    13   'res2b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    14   'res2b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    15   'res3a_branch2a'                    2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    16   'res3a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    17   'res3a_branch2b'                    2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    18   'res3a_branch1'                     2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    19   'res3a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    20   'res3a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    21   'res3b_branch2a'                    2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'res3b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    23   'res3b_branch2b'                    2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'res3b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    25   'res3b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    26   'res4a_branch2a'                    2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    27   'res4a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    28   'res4a_branch2b'                    2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    29   'res4a_branch1'                     2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    30   'res4a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    31   'res4a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    32   'res4b_branch2a'                    2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'res4b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    34   'res4b_branch2b'                    2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    35   'res4b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    36   'res4b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    37   'res5a_branch2a'                    2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    38   'res5a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    39   'res5a_branch2b'                    2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'res5a_branch1'                     2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    41   'res5a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res5a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    43   'res5b_branch2a'                    2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'res5b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    45   'res5b_branch2b'                    2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'res5b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    47   'res5b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    48   'pool5'                             2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    49   'fc1000'                            Fully Connected              1000 fully connected layer                                            (HW Layer)
    50   'prob'                              Softmax                      softmax                                                               (SW Layer)
    51   'ClassificationLayer_predictions'   Classification Output        crossentropyex with 'tench' and 999 other classes                     (SW Layer)
                                                                                                                                              
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   21328236                  0.10664                       1           21328236              9.4
    ____data_norm_add       210750                  0.00105 
    ____data_norm           210750                  0.00105 
    ____conv1              2164124                  0.01082 
    ____pool1               515064                  0.00258 
    ____res2a_branch2a      966221                  0.00483 
    ____res2a_branch2b      966221                  0.00483 
    ____res2a               210750                  0.00105 
    ____res2b_branch2a      966221                  0.00483 
    ____res2b_branch2b      966221                  0.00483 
    ____res2b               210750                  0.00105 
    ____res3a_branch1       540861                  0.00270 
    ____res3a_branch2a      540749                  0.00270 
    ____res3a_branch2b      919117                  0.00460 
    ____res3a               105404                  0.00053 
    ____res3b_branch2a      919117                  0.00460 
    ____res3b_branch2b      919117                  0.00460 
    ____res3b               105404                  0.00053 
    ____res4a_branch1       503405                  0.00252 
    ____res4a_branch2a      509261                  0.00255 
    ____res4a_branch2b      905421                  0.00453 
    ____res4a                52724                  0.00026 
    ____res4b_branch2a      905421                  0.00453 
    ____res4b_branch2b      905421                  0.00453 
    ____res4b                52724                  0.00026 
    ____res5a_branch1       744525                  0.00372 
    ____res5a_branch2a      751693                  0.00376 
    ____res5a_branch2b     1415373                  0.00708 
    ____res5a                26368                  0.00013 
    ____res5b_branch2a     1415373                  0.00708 
    ____res5b_branch2b     1415373                  0.00708 
    ____res5b                26368                  0.00013 
    ____pool5                54594                  0.00027 
    ____fc1000              207351                  0.00104 
 * The clock frequency of the DL processor is: 200MHz

The estimated frames-per-second performance is 9.4 frames per second. To improve the network performance, you can modify the properties of the custom deep learning processor configuration hPC or use the optimizeConfigurationForNetwork method. In this example, you use the optimizeConfigurationForNetwork method. To learn about modifying the properties manually, see Effects of Custom Deep Learning Processor Parameters on Performance and Resource Utilization.

Generate Optimized Processor Configuration

Optimize the processor configuration by using the optimizeConfigurationForNetwork method. Use the optional FramesPerSecond name-value argument.

hPC_optimized = optimizeConfigurationForNetwork(hPC,net,FramesPerSecond=10);
### Optimizing processor configuration for deep learning network...


              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
Total                       438( 18%)        600( 66%)     270396( 99%)
ReferenceDesign               3(  1%)         78(  9%)      35000( 13%)
DL_Processor                435( 18%)        522( 58%)     235396( 86%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
### Note: Processing module "conv" property "InputMemorySize" changed from "[227 227 3]" to "[217 217 3]".
### Note: Processing module "conv" property "OutputMemorySize" changed from "[227 227 3]" to "[217 217 3]".
### Note: Processing module "conv" property "SegmentationBlockGeneration" changed from "true" to "false".
### Note: Processing module "fc" property "FCThreadNumber" changed from "4" to "8".
### Note: Processing module "fc" property "WeightAXIDataBitwidth" changed from "128" to "256".
### Note: Processing module "fc" property "SoftmaxBlockGeneration" changed from "false" to "true".

                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'off'
                            ConvThreadNumber: 16
                             InputMemorySize: [217 217 3]
                            OutputMemorySize: [217 217 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'on'
                      SigmoidBlockGeneration: 'off'
                              FCThreadNumber: 8
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

### Optimizing processor configuration for deep learning network complete.

Estimate performance of the ResNet-18 network by using the new optimized deep learning processor configuration.

estimatePerformance(hPC_optimized,net);
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.
### The network includes the following layers:
     1   'data'                              Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                             2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'conv1_relu'                        ReLU                         ReLU                                                                  (HW Layer)
     4   'pool1'                             2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     5   'res2a_branch2a'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     6   'res2a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
     7   'res2a_branch2b'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     8   'res2a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
     9   'res2a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    10   'res2b_branch2a'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    11   'res2b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    12   'res2b_branch2b'                    2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    13   'res2b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    14   'res2b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    15   'res3a_branch2a'                    2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    16   'res3a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    17   'res3a_branch2b'                    2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    18   'res3a_branch1'                     2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    19   'res3a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    20   'res3a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    21   'res3b_branch2a'                    2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'res3b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    23   'res3b_branch2b'                    2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'res3b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    25   'res3b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    26   'res4a_branch2a'                    2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    27   'res4a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    28   'res4a_branch2b'                    2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    29   'res4a_branch1'                     2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    30   'res4a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    31   'res4a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    32   'res4b_branch2a'                    2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'res4b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    34   'res4b_branch2b'                    2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    35   'res4b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    36   'res4b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    37   'res5a_branch2a'                    2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    38   'res5a_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    39   'res5a_branch2b'                    2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'res5a_branch1'                     2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    41   'res5a'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res5a_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    43   'res5b_branch2a'                    2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'res5b_branch2a_relu'               ReLU                         ReLU                                                                  (HW Layer)
    45   'res5b_branch2b'                    2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'res5b'                             Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    47   'res5b_relu'                        ReLU                         ReLU                                                                  (HW Layer)
    48   'pool5'                             2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    49   'fc1000'                            Fully Connected              1000 fully connected layer                                            (HW Layer)
    50   'prob'                              Softmax                      softmax                                                               (HW Layer)
    51   'ClassificationLayer_predictions'   Classification Output        crossentropyex with 'tench' and 999 other classes                     (SW Layer)
                                                                                                                                              
### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   19966252                  0.09983                       1           19966252             10.0
    ____data_norm_add       210750                  0.00105 
    ____data_norm           210750                  0.00105 
    ____conv1              2224339                  0.01112 
    ____pool1               632402                  0.00316 
    ____res2a_branch2a     1038708                  0.00519 
    ____res2a_branch2b     1038708                  0.00519 
    ____res2a               210750                  0.00105 
    ____res2b_branch2a     1038708                  0.00519 
    ____res2b_branch2b     1038708                  0.00519 
    ____res2b               210750                  0.00105 
    ____res3a_branch1       630228                  0.00315 
    ____res3a_branch2a      625092                  0.00313 
    ____res3a_branch2b      919117                  0.00460 
    ____res3a               105404                  0.00053 
    ____res3b_branch2a      919117                  0.00460 
    ____res3b_branch2b      919117                  0.00460 
    ____res3b               105404                  0.00053 
    ____res4a_branch1       503405                  0.00252 
    ____res4a_branch2a      509261                  0.00255 
    ____res4a_branch2b      905421                  0.00453 
    ____res4a                52724                  0.00026 
    ____res4b_branch2a      905421                  0.00453 
    ____res4b_branch2b      905421                  0.00453 
    ____res4b                52724                  0.00026 
    ____res5a_branch1       506957                  0.00253 
    ____res5a_branch2a      514125                  0.00257 
    ____res5a_branch2b      940237                  0.00470 
    ____res5a                26368                  0.00013 
    ____res5b_branch2a      940237                  0.00470 
    ____res5b_branch2b      940237                  0.00470 
    ____res5b                26368                  0.00013 
    ____pool5                54594                  0.00027 
    ____fc1000              103438                  0.00052 
    ____prob                  1262                  0.00001 
 * The clock frequency of the DL processor is: 200MHz

The new estimated frames per second performance is 10 frames per second.

This image shows the comparison between the original processor configuration and the optimized processor configuration:

The optimized processor configuration has:

  • SegmentationBlockGeneration turned off.

  • InputMemorySize and OutputMemorySize reduced to [217 217 3].

  • SoftMaxBlockGeneration turned on.

  • FCThreadNumber increased to 8.

Generate Optimized Custom Bitstream

Use the optimized custom deep learning processor configuration to build and generate a custom bitstream. Use the custom bitstream to deploy the pretrained ResNet-18 network to your target FPGA board.

hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2023.1\bin\vivado.bat');
dlhdl.buildProcessor(hPC_optimized);

See Also

|

Related Topics