Estimate Resource Utilization for Custom Processor Configuration
To estimate the resource utilization of a custom processor configuration, compare resource utilization for a custom processor configuration to resource utilization of a reference (shipping) bitstream processor configuration. Analyze the effects of custom deep learning processor parameters on resource utilization.
Estimate Resource Utilization
Calculate resource utilization for a custom processor configuration.
Create a
dlhdl.ProcessorConfig
object.hPC = dlhdl.ProcessorConfig
hPC = Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'on' ConvThreadNumber: 16 InputMemorySize: [227 227 3] OutputMemorySize: [227 227 3] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' FCThreadNumber: 4 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "adder" ModuleGeneration: 'on' InputMemorySize: 40 OutputMemorySize: 40 Processor Top Level Properties RunTimeControl: 'register' InputDataInterface: 'External Memory' OutputDataInterface: 'External Memory' ProcessorDataType: 'single' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 200 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: ''
Call
estimateResources
to retrieve resource utilization.hPC.estimateResources
Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- DL_Processor 377( 15%) 508( 56%) 234175( 86%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
The returned table contains resource utilization for the entire processor and individual modules.
Customize Bitstream Configuration to Meet Resource Use Requirements
The user wants to deploy a digit recognition network with a target performance of 500 frames per second (FPS) to a Xilinx™ ZCU102 ZU4CG device. The target device resource counts are:
Digital signal processor (DSP) slice count - 240
Block random access memory (BRAM) count -128
The reference (shipping) zcu102_int8
bitstream configuration is for a Xilinx ZCU102 ZU9EG device. The default board resource counts are:
Digital signal processor (DSP) slice count - 2520
Block random access memory (BRAM) count -912
The default board resource counts exceed the user resource budget and is on the higher end of the cost spectrum. You can achieve target performance and resource use budget by quantizing the target deep learning network and customizing the custom default bitstream configuration.
In this example create a custom bitstream configuration to match your resource budget and performance requirements.
Prerequisites
Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Deep Learning Toolbox Model Quantization Library
Load Pretrained Network
To load the pretrained series network, that has been trained on the Modified National Institute Standards of Technology (MNIST) database, enter:
snet = getDigitsNetwork;
Quantize Network
To quantize the MNIST based digits network, enter:
dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA'); Image = imageDatastore('five_28x28.pgm','Labels','five'); dlquantObj.calibrate(Image);
Retrieve zcu102_int Bitstream Configuration
To retrieve the zcu102_int8
bitstream configuration, use the dlhdl.ProcessorConfig
object. For more information, see dlhdl.ProcessorConfig
. To learn about modifiable parameters of the processor configuration, see getModuleProperty
and setModuleProperty
.
hPC_reference = dlhdl.ProcessorConfig('Bitstream','zcu102_int8')
hPC_reference = Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' SegmentationBlockGeneration: 'on' ConvThreadNumber: 64 InputMemorySize: [227 227 3] OutputMemorySize: [227 227 3] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' SigmoidBlockGeneration: 'off' FCThreadNumber: 16 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "custom" ModuleGeneration: 'on' Addition: 'on' Multiplication: 'on' Resize2D: 'off' Sigmoid: 'off' TanhLayer: 'off' InputMemorySize: 40 OutputMemorySize: 120 Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'int8' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 250 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: ''
Estimate Network Performance and Resource Utilization for zcu102_int8
Bitstream Configuration
To estimate the performance of the digits series network, use the estimatePerformance
function of the dlhdl.ProcessorConfig
object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).
To estimate the resource use of the zcu102_int8
bitstream,
use the estimateResources
function of the dlhdl.ProcessorConfig
object. The function returns the estimated DSP slice and BRAM usage.
hPC_reference.estimatePerformance(dlquantObj)
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### The network includes the following layers: 1 'imageinput' Image Input 28×28×1 images with 'zerocenter' normalization (SW Layer) 2 'conv_1' Convolution 8 3×3×1 convolutions with stride [1 1] and padding 'same' (HW Layer) 3 'relu_1' ReLU ReLU (HW Layer) 4 'maxpool_1' Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 5 'conv_2' Convolution 16 3×3×8 convolutions with stride [1 1] and padding 'same' (HW Layer) 6 'relu_2' ReLU ReLU (HW Layer) 7 'maxpool_2' Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 8 'conv_3' Convolution 32 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 9 'relu_3' ReLU ReLU (HW Layer) 10 'fc' Fully Connected 10 fully connected layer (HW Layer) 11 'softmax' Softmax softmax (SW Layer) 12 'classoutput' Classification Output crossentropyex with '0' and 9 other classes (SW Layer) ### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 58101 0.00023 1 58101 4302.9 ____conv_1 4391 0.00002 ____maxpool_1 2877 0.00001 ____conv_2 2351 0.00001 ____maxpool_2 2265 0.00001 ____conv_3 2651 0.00001 ____fc 43566 0.00017 * The clock frequency of the DL processor is: 250MHz
hPC_reference.estimateResources
Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- DL_Processor 797( 32%) 386( 43%) 142494( 52%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
The estimated performance is 4303 FPS and the estimated resource use counts are:
Digital signal processor (DSP) slice count - 797
Block random access memory (BRAM) count -386
The estimated DSP slice count and BRAM count use exceeds the target device resource budget. Customize the bitstream configuration to reduce resource use.
Create Custom Bitstream Configuration
To create a custom processor configuration, use the dlhdl.ProcessorConfig
object. For more information, see dlhdl.ProcessorConfig
. To learn about modifiable parameters of the processor configuration, see getModuleProperty
and setModuleProperty
.
To reduce the resource use for the custom bitstream, modify the KernelDataType
for the conv, fc, and
adder modules. Modify the ConvThreadNumber
to reduce DSP slice count. Reduce the InputMemorySize
and OutputMemorySize
for the conv
module to reduce BRAM count.
hPC_custom = dlhdl.ProcessorConfig; hPC_custom.ProcessorDataType = 'int8'; hPC_custom.setModuleProperty('conv','ConvThreadNumber',4); hPC_custom.setModuleProperty('conv','InputMemorySize',[30 30 1]); hPC_custom.setModuleProperty('conv','OutputMemorySize',[30 30 1]); hPC_custom
hPC_custom = Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' SegmentationBlockGeneration: 'on' ConvThreadNumber: 4 InputMemorySize: [30 30 1] OutputMemorySize: [30 30 1] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' SigmoidBlockGeneration: 'off' FCThreadNumber: 4 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "custom" ModuleGeneration: 'on' Addition: 'on' Multiplication: 'on' Resize2D: 'off' Sigmoid: 'off' TanhLayer: 'off' InputMemorySize: 40 OutputMemorySize: 120 Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'int8' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 200 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: ''
Estimate Network Performance and Resource Utilization for Custom Bitstream Configuration
To estimate the performance of the digits series network, use the estimatePerformance
function of the dlhdl.ProcessorConfig
object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).
To estimate the resource use of the hPC_custom
bitstream,
use the estimateResources
function of the dlhdl.ProcessorConfig
object. The function returns the estimated DSP slice and BRAM usage.
hPC_custom.estimatePerformance(dlquantObj)
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### The network includes the following layers: 1 'imageinput' Image Input 28×28×1 images with 'zerocenter' normalization (SW Layer) 2 'conv_1' Convolution 8 3×3×1 convolutions with stride [1 1] and padding 'same' (HW Layer) 3 'relu_1' ReLU ReLU (HW Layer) 4 'maxpool_1' Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 5 'conv_2' Convolution 16 3×3×8 convolutions with stride [1 1] and padding 'same' (HW Layer) 6 'relu_2' ReLU ReLU (HW Layer) 7 'maxpool_2' Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 8 'conv_3' Convolution 32 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 9 'relu_3' ReLU ReLU (HW Layer) 10 'fc' Fully Connected 10 fully connected layer (HW Layer) 11 'softmax' Softmax softmax (SW Layer) 12 'classoutput' Classification Output crossentropyex with '0' and 9 other classes (SW Layer) ### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 433577 0.00217 1 433577 461.3 ____conv_1 26160 0.00013 ____maxpool_1 31888 0.00016 ____conv_2 44736 0.00022 ____maxpool_2 22337 0.00011 ____conv_3 265045 0.00133 ____fc 43411 0.00022 * The clock frequency of the DL processor is: 200MHz
hPC_custom.estimateResources
Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- DL_Processor 131( 6%) 108( 12%) 56270( 21%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
The estimated performance is 461.3 FPS and the estimated resource use counts are:
Digital signal processor (DSP) slice count - 131
Block random access memory (BRAM) count -108
The estimated resources of the customized bitstream match the user target device resource budget and the estimated performance matches the target network performance.
See Also
dlhdl.ProcessorConfig
| getModuleProperty
| setModuleProperty
| estimatePerformance
| estimateResources