Contenu principal

Estimate Performance of Deep Learning Network

To reduce the time required to design a custom deep learning network that meets performance requirements, before deploying the network, analyze layer level latencies. Compare deep learning network performances on custom bitstream processor configurations to performances on reference (shipping) bitstream processor configurations.

To learn how to use the information in the table data from the estimatePerformance function to calculate your network performance, see Profile Inference Run.

Estimate Performance of Custom Deep Learning Network for Custom Processor Configuration

This example shows how to calculate the performance of a deep learning network for a custom processor configuration.

  1. Create a file in your current working folder called getLogoNetwork.m. In the file, enter:

    function net = getLogoNetwork()
     if ~isfile('LogoNet.mat')
            url = 'https://www.mathworks.com/supportfiles/gpucoder/cnn_models/logo_detection/LogoNet.mat';
            websave('LogoNet.mat',url);
        end
        data = load('LogoNet.mat');
        net  = data.convnet;
    end

    Call the function and save the result in snet.

    snet = getLogoNetwork;
  2. Create a dlhdl.ProcessorConfig object.

    hPC = dlhdl.ProcessorConfig;
  3. Call estimatePerformance with snet to retrieve the layer level latencies and performance for the LogoNet network.

    hPC.estimatePerformance(snet)
    3 Memory Regions created.
    
    
    
                  Deep Learning Processor Estimator Performance Results
    
                       LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                             -------------             -------------              ---------        ---------       ---------
    Network                   39853460                  0.19927                       1           39853460              5.0
            conv_1             6825287                  0.03413 
            maxpool_1          3755088                  0.01878 
            conv_2            10440701                  0.05220 
            maxpool_2          1447840                  0.00724 
            conv_3             9393397                  0.04697 
            maxpool_3          1765856                  0.00883 
            conv_4             1770484                  0.00885 
            maxpool_4            28098                  0.00014 
            fc_1               2644884                  0.01322 
            fc_2               1692532                  0.00846 
            fc_3                 89293                  0.00045 
     * The clock frequency of the DL processor is: 200MHz

Evaluate Performance of Deep Learning Network on Custom Processor Configuration

Benchmark the performance of a deep learning network on a custom bitstream configuration by comparing it to the performance on a reference (shipping) bitstream configuration. Use the comparison results to adjust your custom deep learning processor parameters to achieve optimum performance.

In this example compare the performance of the ResNet-18 network on the zcu102_single bitstream configuration to the performance on the default custom bitstream configuration.

Load Pretrained Network

Load the pretrained network.

snet = resnet18;

Retrieve zcu102_single Bitstream Configuration

To retrieve the zcu102_single bitstream configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

hPC_shipping = dlhdl.ProcessorConfig('Bitstream',"zcu102_single")
hPC_shipping = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'on'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                   Processing Module "adder"
                            ModuleGeneration: 'on'
                             InputMemorySize: 40
                            OutputMemorySize: 40

              Processor Top Level Properties
                              RunTimeControl: 'register'
                          InputDataInterface: 'External Memory'
                         OutputDataInterface: 'External Memory'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 220
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate ResNet-18 Performance for zcu102_single Bitstream Configuration

To estimate the performance of the ResNet-18 DAG network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

hPC_shipping.estimatePerformance(snet)
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   23634966                  0.10743                       1           23634966              9.3
    ____conv1              2165372                  0.00984 
    ____pool1               646226                  0.00294 
    ____res2a_branch2a      966221                  0.00439 
    ____res2a_branch2b      966221                  0.00439 
    ____res2a               210750                  0.00096 
    ____res2b_branch2a      966221                  0.00439 
    ____res2b_branch2b      966221                  0.00439 
    ____res2b               210750                  0.00096 
    ____res3a_branch1       540749                  0.00246 
    ____res3a_branch2a      763860                  0.00347 
    ____res3a_branch2b      919117                  0.00418 
    ____res3a               105404                  0.00048 
    ____res3b_branch2a      919117                  0.00418 
    ____res3b_branch2b      919117                  0.00418 
    ____res3b               105404                  0.00048 
    ____res4a_branch1       509261                  0.00231 
    ____res4a_branch2a      509261                  0.00231 
    ____res4a_branch2b      905421                  0.00412 
    ____res4a                52724                  0.00024 
    ____res4b_branch2a      905421                  0.00412 
    ____res4b_branch2b      905421                  0.00412 
    ____res4b                52724                  0.00024 
    ____res5a_branch1      1046605                  0.00476 
    ____res5a_branch2a     1046605                  0.00476 
    ____res5a_branch2b     2005197                  0.00911 
    ____res5a                26368                  0.00012 
    ____res5b_branch2a     2005197                  0.00911 
    ____res5b_branch2b     2005197                  0.00911 
    ____res5b                26368                  0.00012 
    ____pool5                54594                  0.00025 
    ____fc1000              207852                  0.00094 
 * The clock frequency of the DL processor is: 220MHz

Create Custom Processor Configuration

To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

hPC_custom = dlhdl.ProcessorConfig
hPC_custom = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'on'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                   Processing Module "adder"
                            ModuleGeneration: 'on'
                             InputMemorySize: 40
                            OutputMemorySize: 40

              Processor Top Level Properties
                              RunTimeControl: 'register'
                          InputDataInterface: 'External Memory'
                         OutputDataInterface: 'External Memory'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate ResNet-18 Performance for Custom Bitstream Configuration

To estimate the performance of the ResNet-18 DAG network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

hPC_custom.estimatePerformance(snet)
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   21219873                  0.10610                       1           21219873              9.4
    ____conv1              2165372                  0.01083 
    ____pool1               646226                  0.00323 
    ____res2a_branch2a      966221                  0.00483 
    ____res2a_branch2b      966221                  0.00483 
    ____res2a               210750                  0.00105 
    ____res2b_branch2a      966221                  0.00483 
    ____res2b_branch2b      966221                  0.00483 
    ____res2b               210750                  0.00105 
    ____res3a_branch1       540749                  0.00270 
    ____res3a_branch2a      708564                  0.00354 
    ____res3a_branch2b      919117                  0.00460 
    ____res3a               105404                  0.00053 
    ____res3b_branch2a      919117                  0.00460 
    ____res3b_branch2b      919117                  0.00460 
    ____res3b               105404                  0.00053 
    ____res4a_branch1       509261                  0.00255 
    ____res4a_branch2a      509261                  0.00255 
    ____res4a_branch2b      905421                  0.00453 
    ____res4a                52724                  0.00026 
    ____res4b_branch2a      905421                  0.00453 
    ____res4b_branch2b      905421                  0.00453 
    ____res4b                52724                  0.00026 
    ____res5a_branch1       751693                  0.00376 
    ____res5a_branch2a      751693                  0.00376 
    ____res5a_branch2b     1415373                  0.00708 
    ____res5a                26368                  0.00013 
    ____res5b_branch2a     1415373                  0.00708 
    ____res5b_branch2b     1415373                  0.00708 
    ____res5b                26368                  0.00013 
    ____pool5                54594                  0.00027 
    ____fc1000              207351                  0.00104 
 * The clock frequency of the DL processor is: 200MHz

The performance of the ResNet-18 network on the custom bitstream configuration is lower than the performance on the zcu102_single bitstream configuration. The difference between the custom bitstream configuration and the zcu102_single bitstream configuration is the target frequency.

Modify Custom Processor Configuration

Modify the custom processor configuration to increase the target frequency. To learn about modifiable parameters of the processor configuration, see dlhdl.ProcessorConfig.

hPC_custom.TargetFrequency = 220;
hPC_custom
hPC_custom = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'on'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                   Processing Module "adder"
                            ModuleGeneration: 'on'
                             InputMemorySize: 40
                            OutputMemorySize: 40

              Processor Top Level Properties
                              RunTimeControl: 'register'
                          InputDataInterface: 'External Memory'
                         OutputDataInterface: 'External Memory'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 220
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Re-estimate ResNet-18 Performance for Modified Custom Bitstream Configuration

Estimate the performance of the ResNet-18 DAG network on the modified custom bitstream configuration.

hPC_custom.estimatePerformance(snet)
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   23634966                  0.10743                       1           23634966              9.3
    ____conv1              2165372                  0.00984 
    ____pool1               646226                  0.00294 
    ____res2a_branch2a      966221                  0.00439 
    ____res2a_branch2b      966221                  0.00439 
    ____res2a               210750                  0.00096 
    ____res2b_branch2a      966221                  0.00439 
    ____res2b_branch2b      966221                  0.00439 
    ____res2b               210750                  0.00096 
    ____res3a_branch1       540749                  0.00246 
    ____res3a_branch2a      763860                  0.00347 
    ____res3a_branch2b      919117                  0.00418 
    ____res3a               105404                  0.00048 
    ____res3b_branch2a      919117                  0.00418 
    ____res3b_branch2b      919117                  0.00418 
    ____res3b               105404                  0.00048 
    ____res4a_branch1       509261                  0.00231 
    ____res4a_branch2a      509261                  0.00231 
    ____res4a_branch2b      905421                  0.00412 
    ____res4a                52724                  0.00024 
    ____res4b_branch2a      905421                  0.00412 
    ____res4b_branch2b      905421                  0.00412 
    ____res4b                52724                  0.00024 
    ____res5a_branch1      1046605                  0.00476 
    ____res5a_branch2a     1046605                  0.00476 
    ____res5a_branch2b     2005197                  0.00911 
    ____res5a                26368                  0.00012 
    ____res5b_branch2a     2005197                  0.00911 
    ____res5b_branch2b     2005197                  0.00911 
    ____res5b                26368                  0.00012 
    ____pool5                54594                  0.00025 
    ____fc1000              207852                  0.00094 
 * The clock frequency of the DL processor is: 220MHz

See Also

| | | |

Topics