Reidentify People Throughout a Video Sequence Using ReID Network
This example shows how to track people throughout a video sequence using re-identification with a residual network.
Re-identification (ReID) is a critical component in visual object tracking that aims to solve the problem of temporary object occlusion in videos. In real-world scenarios, an object being tracked can be temporarily occluded by other objects or leave the field of view of the camera, making it difficult to track consistently. These objects can also differ frame-to-frame in pose, orientation, and lighting conditions. In these complicated scenarios, the tracker often fails to reidentify the object when it reappears in a new video frame. The tracker then starts tracking the object as a new object. This misidentification leads to errors and inconsistencies in object tracking.
ReID aims to solve this problem by identifying the same object in the new frame by matching its features to the previously tracked object features, even if it appears in a different location or orientation, or has dissimilar lighting compared to the previous frame. This approach ensures that the tracker can maintain consistent tracking information for a given object.
ReID is typically used in tracking applications such as surveillance, automated driving systems, robot vision, and sports analytics, where accurate and consistent tracking of objects is essential.
This example first shows how to perform re-identification in a video sequence with a pretrained ReID network. The second part of the example shows how to train a ReID network as a traditional classification network with cross-entropy loss. After training is complete, the output layers of the network to have an appearance feature vector with a length of 128.
Load Pretrained Re-Identification Network
Load the pretrained ReID network trained on the pedestrian dataset. To train the network, see the Train ReID Network section of this example.
pretrainedNet = helperDownloadReIDNetwork; pretrainedNet = initialize(pretrainedNet);
Reidentify Pedestrian in Video Sequence
Download the pedestrian tracking test video file.
datasetname="PedestrianTracking"; videoURL = "https://ssd.mathworks.com/supportfiles/vision/data/PedestrianTrackingVideo.avi"; if ~exist("PedestrianTrackingVideo.avi","file") disp("Downloading Pedestrian Tracking Video (35 MB)") websave("PedestrianTrackingVideo.avi",videoURL); end
Load a pretrained object detector.
detector = yolov4ObjectDetector("csp-darknet53-coco");
Read the pedestrian tracking video.
pedestrianVideo = VideoReader("PedestrianTrackingVideo.avi");
To detect all objects in each frame, iterate through the video sequence. Compute the output of the pretrained ReID network by passing pretrainedNet
and the cropped pedestrian objects as inputs to the predict
function. The output, appearanceDLArray
, is an appearance feature vector with a length of 128.
To identify the same individual throughout the video sequence, compare the appearance feature vector of the first pedestrian and each distinct subsequently detected pedestrian using the cosine similarity matrix. The values of cosine similarity range from -1 to 1, where 1 indicates that the pedestrian images are identical, 0 indicates that the images are not very alike, and -1 indicates that the images are vastly different. To match only images that are closely related to one another, set the similarity threshold similarityThreshold
to 0.85
.
pedestrianFeature = []; pedestrianMontage = {}; while hasFrame(pedestrianVideo) % Read the current frame. vidFrame = readFrame(pedestrianVideo); % Run the detector, crop all bounding boxes to the frame, and round the % bounding box to integer values. [bboxes, scores, labels] = detect(detector,vidFrame,Threshold=0.5); bboxes = bboxcrop(bboxes,[1 1 size(vidFrame,2) size(vidFrame,1)]); bboxes = round(bboxes); % Count the number of each object detected in the frame and find the % number of people detected. numLabels = countcats(labels); numPedestrians = numLabels(1); % Crop each detected person and pass the cropped pedestrian through the % pretrained ReID network to obtain appearance feature vectors. appearanceData = zeros(numPedestrians,128); croppedPerson = cell(numPedestrians); pedestrian = 1; for i = 1:size(bboxes,1) % Pass only detected pedestrian objects through the pretrained network. if labels(i) == "person" bbox = bboxes(i,:); croppedImg = vidFrame(bbox(2):bbox(2)+bbox(4),bbox(1):bbox(1)+bbox(3),:); croppedPerson{pedestrian} = imresize(croppedImg,[128 64]); appearanceDLArray = predict(pretrainedNet,dlarray(im2single(croppedPerson{pedestrian})*255)); appearanceData(pedestrian,:) = appearanceDLArray; pedestrian = pedestrian + 1; end end % Obtain the first pedestrian feature vector and use the best % matching feature vector in each frame to continuously track the pedestrian through % the video sequence. if isempty(pedestrianFeature) pedestrianFeature = appearanceData(1,:); pedestrianMontage{end+1} = croppedPerson{1}; else normAppearanceData = appearanceData./vecnorm(appearanceData,2,2); normPedestrianFeature = pedestrianFeature./vecnorm(pedestrianFeature,2,2); cosineSimilarity = normAppearanceData*normPedestrianFeature'; [cosSim,matchIdx] = max(cosineSimilarity); % Update the pedestrian feature vector to the latest frame data. % Here, filter out the best matching feature vector if it is not % close enough to the last known feature vector. This approach helps handle % the case where the person is no longer in the video frame. similarityThreshold = 0.85; if cosSim > similarityThreshold pedestrianFeature = appearanceData(matchIdx,:); pedestrianMontage{end+1} = croppedPerson{matchIdx}; end end end
Display the pedestrian identified throughout the video sequence.
montage(pedestrianMontage)
The network reidentifies the individual approximately 75% of the time throughout the video, with 73 distinct video frames containing the individual. The tracking logic, where you use a simple cosine similarity threshold to filter and match appearance feature vectors, leads to this imperfect performance. Increasing the threshold results in other pedestrians taking over the object track. Meanwhile, the simple tracking logic cannot identify the corresponding matches of other pedestrians in each frame.
To significantly improve the network tracking performance, implement the robust tracking logic in the Multi-Object Tracking with DeepSORT (Sensor Fusion and Tracking Toolbox) example.
Load Training Data
To train the ReID network, first label the video sequence data with a labeling tool such as Image Labeler or Ground Truth Labeler (Automated Driving Toolbox). Each detected object identity must be tracked through every frame for each video, ensuring the identity label is consistent across video sequences. To ensure that the object is consistently labeled in each frame, assign different labels for each identity or use a string attribute. For videos that have minimal variation per object, use Create Automation Algorithm Function for Labeling to help with manual labeling tasks.
Once the data has been fully labeled and exported from a labeler, use groundTruth
to directly create imageDatastore
and boxLabelDatastore
objects with an objectDetectorTrainingData
object. To train the classifier that you convert into a ReID network, process the data further so that only the object of interest is in the bounding box. Resize these cropped images immediately or during the preprocessing stage of training the classifier.
In this example, the pedestrianDataset.zip
file contains a folder that has 30 subfolders with cropped training images. Each object identity is organized into its own subfolder, for a total of 30 identities. See the Load Test Data section of this example for the entire pre-processing workflow to use with your own labeled data.
Unzip the pedestrian training data using the helperUnzipData
helper function.
unzipDirectory = pwd; helperUnzipData(unzipDirectory)
Generate Training Data with Synthetic Object Occlusions
One of the main challenges with re-identification is identifying an object when it is partially occluded from view. In the pedestrian case, other pedestrians can mostly block the individual of interest from view. Because training data does not often contain such images, generating synthetic training data that includes occlusion improves the network robustness to partial occlusion.
First, set the random seed for repeatable occlusion data generation.
rng(0)
Generate synthetic training data using the helperGenerateOcclusionData
helper functions. Store the occlusion training data, including the original images, in occlusionDatasetDirectory
.
datasetFolder = "pedestrianDataset"; trainingDataFolder = fullfile(unzipDirectory,datasetFolder); occlusionDatasetDirectory = fullfile("pedestrianOcclusionDataset"); imds = imageDatastore(trainingDataFolder,IncludeSubfolders=true,LabelSource="foldernames"); if ~exist("generateOcclusionData","var") generateOcclusionData = true; end if generateOcclusionData && ~exist(occlusionDatasetDirectory,"dir") writeall(imds,occlusionDatasetDirectory,WriteFcn=@(img,writeInfo,outputFormat) ... helperGenerateOcclusionData(img,writeInfo,outputFormat,datasetFolder,imds)); generateOcclusionData = false; end
The helperGenerateOcclusionData
helper function inserts occlusions into each image by performing these steps.
1. Segment the object within the crop using the grabcut
function.
2. Remove additional background pixels from the grabcut
segmentation with the activecontour
function.
3. Apply a Gaussian blur to offset the potential sharp insertion of the segmented object.
4. Resize and shift the segmented object to ensure the pedestrian of the base image is not entirely blocked.
Apply this process before the main data augmentation step because you must review the quality of the generated training images. This algorithm assumes that the training images are of a single individual (a closely cropped image). If the training images contain excessive background, tune the number of superpixels, the grabcut
function properties, and activecontour
function properties.
Load Training Data into Datastore
Load the cropped and organized training data into an ImageDatastore
object. Set the IncludeSubfolders
argument to true
to use the all of the data in trainingDataFolder
. Set the LabelSource
argument to "foldernames"
to use the corresponding folders as the training data labels.
trainImds = imageDatastore(fullfile(occlusionDatasetDirectory,datasetFolder),IncludeSubfolders=true,LabelSource="foldernames");
Prepare Data for Training
Shuffle the datastore prior to splitting into training and validation sets to ensure that the training and validation sets include a mix of individuals.
ds = shuffle(trainImds); numTraining = round(size(trainImds.Files,1)*0.8); dsTrain = subset(ds,1:numTraining); dsVal = subset(ds,numTraining+1:size(trainImds.Files,1));
To improve the ReID network robustness, use the imageDataAugmenter
(Deep Learning Toolbox) function to apply several training data augmentations including shifting, flipping, scaling, and shearing.
inputSize = [128 64 3]; pixelShiftRange = [-16 16]; imageAugmenter = imageDataAugmenter( ... RandXTranslation=pixelShiftRange, ... RandYTranslation=pixelShiftRange, ... RandXReflection=true, ... RandRotation=[-15 15], ... RandScale=[0.75 1.25]);
Create an augmentedImageDatastore
(Deep Learning Toolbox) object with imageAugmenter
.
augDSTrain = augmentedImageDatastore( ... inputSize(1:2), ... dsTrain, ... DataAugmentation=imageAugmenter);
Preview the augmented training data, which includes the inserted occlusions.
previewImg = readByIndex(augDSTrain,19:22); montage(previewImg.input,Size=[1 4])
Reset the datastore to its initial state.
reset(augDSTrain)
Define ReID Network Architecture
The helperCreateReIDNetResnet
helper function uses resnetLayers
(Deep Learning Toolbox) to create a custom residual network, whose design is based on the DeepSORT ReID network [1]. Set the feature vector length to 128. To learn more identifying details per individual, increase the feature dimension at the expense of slower training and inference time.
allLabels = unique(trainImds.Labels); numClasses = numel(allLabels); featureDim = 128; net = helperCreateReIDNetResnet(numClasses, featureDim, inputSize);
Specify Training Options
Specify the training options.
numEpochs = 150; miniBatchSize = 64; options = trainingOptions("adam", ... MaxEpochs=numEpochs, ... ValidationData=dsVal, ... InitialLearnRate=0.01, ... LearnRateDropFactor=0.1, ... LearnRateDropPeriod=round(numEpochs/3), ... LearnRateSchedule="piecewise", ... MiniBatchSize=miniBatchSize, ... Shuffle="every-epoch", ... VerboseFrequency=10, ... ValidationFrequency=30, ... OutputNetwork="best-validation-loss", ... Verbose=false, ... Plots="training-progress");
Train ReID Network
Use trainnet
(Deep Learning Toolbox) to train the ReID network if the doTraining
variable is true
. Training takes about 1 hour on a 24 GB GPU. To prevent out-of-memory errors, reduce the mini-batch size if your system has less memory.
doTraining = false; if doTraining net = trainnet(augDSTrain,net,"crossentropy",options); else load("personReIDResNet.mat","net"); end
Remove Cosine Softmax Layer
After training, to obtain only the appearance feature vector as a network output, remove the cosine softmax layer if the network contains one. Use the classification layer, which is not part of the network only during training with trainnet
.
if strcmp(net.OutputNames{1},"Cosine_Softmax") % Remove cosine softmax layer. net = removeLayers(net,net.OutputNames{1}); end net = initialize(net);
Evaluate ReID Network
Load Test Data
Load the labeled pedestrian ground truth test data.
load("pedestrianLabelTestData.mat","gTruth");
Process test data and store network-ready input images. The helperCropImagesWithGroundtruth
helper function uses the ground truth data to crop out all the labeled test data within the video frames. The function also resizes the cropped images to a size of 128-by-64 pixels and organizes the labels into individual folders under the root testDataFolder
.
testDataFolder = fullfile("pedestrianTestData"); if ~isfolder(testDataFolder) helperCropImagesWithGroundtruth(gTruth,testDataFolder) end
Load the cropped and organized test data into an ImageDatastore
object. Set the IncludeSubfolders
name-value argument to true
to use the all of the data in trainingDataFolder
, and set LabelSource
to "foldernames"
to use the corresponding folder names as the training data labels.
testImds = imageDatastore(testDataFolder,IncludeSubfolders=true,LabelSource="foldernames");
Obtain Appearance Feature Vectors For Test Data
Create a mini-batch queue and pass the test datastore testImds
through the ReID network to obtain appearance vectors for each test image. Set the minibatchqueue
to read data in batches of 64. To prevent out-of-memory errors, reduce the mini-batch size if your system has limited resources.
miniBatchSize = 64; testImds.ReadSize = miniBatchSize; mbq = minibatchqueue(testImds, ... MiniBatchSize=miniBatchSize, ... MiniBatchFcn=@(data)cat(4,data{1:end}), ... MiniBatchFormat="SSCB");
Read through the test data in batches and extract the appearance feature vectors.
appearanceFeatures = []; while hasdata(mbq) dlX = next(mbq); dlYPred = predict(net,dlX); appearanceFeatures = [appearanceFeatures dlYPred]; end appearanceData = extractdata(appearanceFeatures);
Calculate Cumulative Matching Characteristics
Evaluate the ReID network with the cumulative matching characteristic (CMC) metric [2]. Given a query image and an image gallery that contains exactly one match to the query image, the CMC metric measures the ability of an identification system to accurately retrieve the correct match from the gallery as one of the top retrieved k items. An image gallery is a subset of the test data, which consists of at least one instance of each individual in the test data. When k is 1, a high CMC value indicates better performance because the correct match has a higher probability of being retrieved from the image gallery.
Use the test appearance feature vectors to obtain an M-by-M cosine similarity matrix, where M is the total number of data points in the test set.
normAppearanceData = appearanceData'./vecnorm(appearanceData',2,2); cosineSimilarity = normAppearanceData*normAppearanceData';
While a standard CMC approach for a multi-instance galler does not exist, you can calculate the CMC for a single-instance gallery using a well defined method. To generate an accurate CMC curve, create multiple gallery sets containing single instances of each individual. Each gallery must contain only one image from each individual identity, which results in a 1-by-N gallery, where N is the total number of identities in the test data. The helperCalculateCMC
helper function performs numTrials
worth of queries on randomly made galleries.
To determine where the ReID network is performing poorly, calculate the CMC for each identity.
First, set the random seed.
rng(0)
Set the number of query trials. A higher number of trials captures the ReID network performance more effectively.
numTrials = 1000;
Count the total number of per-identity instances in the gallery.
numOfIdentityInstances = countEachLabel(testImds);
Set the highest gallery return rank to find and set the total number of identities in the gallery.
identities = size(numOfIdentityInstances,1);
Calculate the CMC for the given ReID network.
[cmcPerIdentity,galleryIdx] = helperCalculateCMC(testImds,cosineSimilarity,numTrials,numOfIdentityInstances,identities);
Average out the CMC per identity to obtain the general CMC.
cmc = mean(cmcPerIdentity);
Plot the CMC curve of each identity in the test set, as well as the average CMC for all identities. The CMC curves show that the ReID network does well on average, identifying the individual around 74% of the time. The top two query results demonstrate a correct re-identification 85% of the time.
plot(1:identities,cmcPerIdentity',1:identities,cmc,"--",LineWidth=2); labels = addcats(numOfIdentityInstances.Label,"Average"); legend(categories(labels),Location="southeast") xlabel("Rank {\it k}"); ylabel("Re-identification Rate"); xlim([0 identities]); ylim([max(0,min(cmcPerIdentity(:,1))-0.1) 1]); title("Cumulative Match Characteristic (CMC) Curve");
According to the CMC curve, the ReID network struggles to identify person 5 and person 8. To understand the variation in network performance, visualize the test data samples for persons 1, 5, and 8.
galleryMontage = {}; for i = 1:3 galleryMontage{end+1} = imtile(testImds.Files(galleryIdx(i,[1 5 8])),GridSize=[1,3]); end montage(galleryMontage,Size=[3,1],BorderSize=[3 0]) title("Sample Gallery Sets - Person 1, 5, and 8 (Left to Right)")
The severely distorted images of person 5 in the gallery sets account for the low performance, as the network cannot generate strong correlations if feature details are minimal. Person 8 is similarly distorted in some frames despite a distinguishing clothing color. Because the training data contains mostly individuals with darker clothing, the network can place greater weight on other identifying characteristics. Additionally, person 8 appears in many far-in-the-background frames and is mostly backward-facing. The network can learn to put weight on characteristics that are more obvious from side or front-facing body positions.
Display two examples of low quality images for person 5 and person 8 which hinder performance.
lowQualityImgs = cell(1,2); lowQualityImgs{1,1} = imread(fullfile(testDataFolder,"person_5","24_03.jpg")); lowQualityImgs{1,2} = imread(fullfile(testDataFolder,"person_8","169_03.jpg")); montage(lowQualityImgs,ThumbnailSize=[256 128]) title("Low Quality Examples of Person 5 and Person 8 (Left to Right)")
To improve performance issues stemming from frames with person 8, train the network with a data set that contains more identities and varied features. If your test data contains objects with significant size differences throughout an image sequence, consider training only with low resolution samples of that object. More empirical analysis is required with this approach: the network likely reverts to learning less unique characteristics, such as clothing color, due to smaller definitive physical differences between individuals.
Summary
Object re-identification for visual tracking is a challenging problem and remains an active research area. In this example, you train a small residual network with a cosine softmax layer as a classifier. You then remove the custom cosine softmax layer to form the final ReID network. Network performance is reasonable for the amount of training data you use.
To improve the network performance, increase the amount of training data. Additional training data must include more challenging scenarios, such as occluded objects, and more varied individuals. If those challenging scenarios are missing, add synthetically occluded objects to increase network robustness. Improve the distortion of individual features in problematic frames to improve network performance.
Helper Functions
helperCreateReIDNetResnet
Create the base residual network (ResNet) using a custom resnetLayers
(Deep Learning Toolbox) object.
Add a second convolutional layer near the input for complex feature learning, replace the traditional softmax layer with a scaled cosine softmax layer, and replace all ReLU activations with ELU activations. By using the cosine softmax instead of Euclidean distance as the similarity metric, the network normalizes the feature vectors to lie on the unit hypersphere, enhancing performance in extreme lighting, and becoming robust to variations in pose and orientation. The software then scales this hypersphere to allow for more defined and spaced-out feature clusters. The ELU activations aid in learning the variability of the pedestrian data.
function net = helperCreateReIDNetResnet(numClasses,featureDim,imageSize) % Set the number of residual blocks and filters in convolutional layers per stack. stackDepth = [2 2 2]; numFilters = [32 64 128]; % Set the initial number of filters in the first convolutional layers. initialNumFilters = 32; % Create the base residual network. lgraph = resnetLayers(imageSize,numClasses, ... BottleneckType="none", ... InitialNumFilters=initialNumFilters, ... InitialFilterSize=3, ... InitialStride=1, ... StackDepth=stackDepth, ... NumFilters=numFilters); % Remove the output classification layer. lgraph = removeLayers(lgraph,"output"); % Create a dlnetwork object from layerGraph. net = dlnetwork(lgraph); % Define a second convolutional layer to allow for more complex feature learning. conv2 = convolution2dLayer(3,initialNumFilters,Padding=1,Stride=1,Name="conv2"); conv2.Weights = randn([3 3 initialNumFilters initialNumFilters])*0.1; conv2.Bias = randn([1 1 initialNumFilters])*0.1; % Define the second batch normalization layer. bn2 = batchNormalizationLayer(Name="bn2"); % Define the second ReLU activation layer. elu2 = eluLayer(Name="elu_2"); % Define the appearance feature dense layer. weightSize = imageSize(1)*imageSize(2)*numFilters(end)/(4^size(numFilters,2)); dense1 = fullyConnectedLayer(featureDim, Name="fc_128"); dense1.Weights = randn([featureDim weightSize])*0.01; dense1.Bias = randn([featureDim 1])*0.01; % Define the cosine softmax layer. weights = randn([numClasses featureDim])*0.01; cosineSoftmax = cosineSoftmaxLayer("Cosine_Softmax",weights); cosineSoftmax = setLearnRateFactor(cosineSoftmax,"K",3); % Define layers in layer arrays to add near the beginning of the network and at the end. layersNearInput = [ conv2 bn2 elu2 ]; layersNearOutput = [ dense1 batchNormalizationLayer cosineSoftmax ]; % Add and connect the layers to the base residual network. net = helperAddAndConnectLayers(net,layersNearInput,layersNearOutput); % Replace all ReLU activation layers with ELU activation layers. net = helperReplaceActivations(net); end
helperAddAndConnectLayers
Connect all the layers of the modified ResNet layer.
function net = helperAddAndConnectLayers(net,layersNearInput,layersNearOutput) % Grab final stack addition and ReLU layer name. finalAddName = net.Layers(end-4).Name; finalReluName = net.Layers(end-3).Name; % Add layers to dlnetwork. net = addLayers(net,layersNearInput); net = addLayers(net,layersNearOutput); % Disconnect relu1 and maxpool1 to add additional conv layer set. net = disconnectLayers(net,"relu1","maxpool1"); % Disconnect global average pooling and fully connected layer. net = disconnectLayers(net,finalAddName,finalReluName); % Remove the base output layers. layersToRemove = { finalReluName "gap" "fc" "softmax" }; net = removeLayers(net,layersToRemove); % Connect relu1 to conv2 layer and elu2 to maxpool1. net = connectLayers(net,"relu1","conv2"); net = connectLayers(net,"elu_2","maxpool1"); % Connect global average pooling and feature appearance fully connect layer set. net = connectLayers(net,finalAddName,"fc_128"); end
helperReplaceActivations
Replace all ReLU activation functions with ELU activations.
function net = helperReplaceActivations(net) % Find all ReLU layers in the network and grab all layer names. allReluIdx = arrayfun(@(x)isa(x,"nnet.cnn.layer.ReLULayer"),net.Layers); reluIdx = find(allReluIdx); layerNames = string({net.Layers.Name}); % Loop through all layers and replace the eluNum = 1; for i = 1:size(reluIdx,1) idx = reluIdx(i); prevLayerName = layerNames(idx); newLayerName = strcat("elu_",num2str(eluNum)); net = replaceLayer(net,prevLayerName,eluLayer(Name=newLayerName)); eluNum = eluNum + 1; % Skip layer name "elu_2" because you add this manually when you create the network. if eluNum == 2 eluNum = eluNum + 1; end end end
helperCropImagesWithGroundtruth
Crop all source images in the ground truth data gTruth
with the bounding box labels gTruth
. Store the cropped images in organized subdirectories in dataFolder
.
function helperCropImagesWithGroundtruth(gTruth,dataFolder) % Use objectDetectorTrainingData to convert the groundTruth data into an imageDataStore and boxLabelDatastore. imageFrameWriteLoc = fullfile("videoFrames"); if ~isfolder(imageFrameWriteLoc) mkdir(imageFrameWriteLoc) end [imds,blds] = objectDetectorTrainingData(gTruth,SamplingFactor=1,WriteLocation=imageFrameWriteLoc); combineDs = combine(imds,blds); writeall(combineDs,"videoFrames",WriteFcn=@(data,info,format)helperWriteCroppedData(data,info,format,dataFolder)) % Remove the video frame images. fprintf(1,"\nCleaning up %s directory.\n",imageFrameWriteLoc); rmdir(imageFrameWriteLoc,"s") end
helperWriteCroppedData
Crop, resize, and store image regions of interest (ROIs) from a combined datastore.
function helperWriteCroppedData(data,info,~,dataFolder) num = 1; for i = 1:size(data{1,2},1) personID = string(data{1,3}(i)); personIDFolder = fullfile(dataFolder,personID); if ~isfolder(personIDFolder) mkdir(personIDFolder) end frame = num2str(info.ReadInfo{1,2}.CurrentIndex); imgPath = fullfile(personIDFolder,strcat(frame,"_",num2str(num,'%02.f'),".jpg")); roi = data{1,2}(i,:); croppedImage = imcrop(data{1,1},roi); resizedImg = imresize(croppedImage,[128 64]); imwrite(resizedImg,imgPath); num = num + 1; end end
helperCalculateCMC
Perform queries for the number of trials to calculate the CMC of the ReID network. Use the countEachLabel
function to calculate how often each identity occurs in the test set, and the total number of identities. Run more trials to increase accuracy of CMC estimates.
function [cmcPerIdentity,galleryIdx] = helperCalculateCMC(testImds,cosineSimilarity,numTrials,numOfIdentityInstances,identities) cmcPerIdentity = zeros(identities); totalTrialsPerID = zeros(identities,1); galleryIdx = ones(numTrials,identities); for trial = 1:numTrials % Build up a random gallery that consists of one image per identity. [galleryIdx,probeIdx] = helperBuildGalleries(testImds,galleryIdx,numOfIdentityInstances,identities,trial); % Choose a random probe image. probe = probeIdx(randi([1,size(probeIdx,1)])); % Obtain the sorted similarities for the given probe and random gallery. probeSim = cosineSimilarity(probe,:); similarities = [gather(probeSim(galleryIdx(trial,:)))', string(testImds.Labels(galleryIdx(trial,:)))]; sortedSims = sortrows(similarities,"descend"); % Determine the logical array that indicates the correct gallery image rank. identity = testImds.Labels(probe); galleryRank = strcmp(sortedSims(:,2),string(identity)); % Check all ranks to determine whether the probe obtains the correct identity. % You can then calculate the CMC for each identity. for rank = 1:identities idLabel = find(numOfIdentityInstances.Label == identity); cmcPerIdentity(idLabel,rank) = cmcPerIdentity(idLabel,rank) + any(galleryRank(1:rank)); end % Accumulate the total number of trials per identity. totalTrialsPerID(idLabel) = totalTrialsPerID(idLabel) + 1; end % Divide the accumulated CMC by the number of trials to obtain the true CMC. cmcPerIdentity = cmcPerIdentity./totalTrialsPerID; end
helperBuildGalleries
Obtain all gallery and probe indices for each query trial.
function [galleryIdx, probeIdx] = helperBuildGalleries(testImds,galleryIdx,numOfIdentityInstances,identities,trial) probeIdx = []; for id = 1:identities identity = numOfIdentityInstances.Label(id); % Find all indicies associated with the current identity. idIdx = find(testImds.Labels == identity); % Choose a random index for the given identity to place in the gallery. randIdx = randi([1, size(idIdx,1)]); galleryIdx(trial,id) = idIdx(randIdx); % Remove the random index from the list of indices for the given identity. idIdx(randIdx) = []; % Add the remaining indices to the potential probe images. probeIdx = [probeIdx; idIdx]; end end
helperGenerateOcclusionData
Generate new training data images that have an individual inserted from another training image. See the Create Occlusion Data for Training section for more details.
function helperGenerateOcclusionData(img,writeInfo,~,datasetFolder,imds) info = writeInfo.ReadInfo; occlusionDataFolder = writeInfo.Location; % Get the name of the training image. fileName = info.Filename; % Find the last slash in the image filename pathto extract % only the actual image file name. if ispc slash = "\"; else slash = "/"; end slashIdx = strfind(info.Filename,slash); imgName = info.Filename(slashIdx(end)+1:end); % Set the output folder for the given indentity. imagesDataFolder = fullfile(occlusionDataFolder,datasetFolder,string(info.Label)); % Copy the original file to the occlusion training data folder if it % does not already exist. if ~isfile(fullfile(imagesDataFolder,imgName)) copyfile(fileName,fullfile(imagesDataFolder,imgName)); end % Grab a random image file from the dataset. randImgIdx = randi([1 size(imds.Files,1)]); randImg = imread(imds.Files{randImgIdx}); % Perform grabcut on the random image to mask some or all of % the individual in the random image. % Create label matrix from super pixels. Choose 1,000 due to % the low image quality. numSuperpixels = 1000; l = superpixels(randImg,numSuperpixels); % Create a region of interest in the random image. Since the % expectation is that the image is almost entirely the % individual, the region of interest is all but a narrow border % around the image. borderSize = 2; roi = false(size(randImg,1:2)); roi(:,borderSize:end-borderSize) = true; % Create the grabcut mask and use it to obtain the masked % image. Set the connectivity to 4 because the image quality is low. bw = grabcut(randImg,l,roi,Connectivity=4); % The grabcut algorithm often leaves a lot of background for % lower quality photos. In an attempt to better mask the % individual of interest, use activecontour to contract the % mask closer. Only perform active contour when 20 percent or % more of the image if masked. if sum(bw,"all") > size(randImg,1)*size(randImg,2)*0.2 bw = activecontour(randImg,bw,"Chan-vese",SmoothFactor=0.3); end % Mask the random image with the segmented mask. maskedImg = bsxfun(@times,randImg,cast(bw,class(randImg))); % Transform the masked image to have some variation in facing % direction, location, and scale. Always make the masked image % smaller to ensure it does not entirely block out the main person. tform = randomAffine2d(XReflection=true, ... XTranslation=[-24 24], ... YTranslation=[-16 32], ... Scale=[0.6 0.8]); rout = affineOutputView(size(img),tform,BoundsStyle="centerOutput"); augMaskedImg = imwarp(maskedImg,tform,OutputView=rout); % Insert the augmented masked image into the original image to % add artificial occlusion. Also blur the masked image so it % blends into the target image. img(augMaskedImg ~= 0) = imgaussfilt(augMaskedImg(augMaskedImg ~= 0),1); imwrite(img,fullfile(imagesDataFolder,strcat(imgName(1:end-5),"_occlusion.jpeg"))); end
helperDownloadReIDNetwork
Download a pretrained ReID network.
function net = helperDownloadReIDNetwork() url = "https://ssd.mathworks.com/supportfiles/vision/data/pretrainedPersonReIDResNet.zip"; zipFile = fullfile(pwd,"pretrainedPersonReIDResNet.zip"); if ~exist(zipFile,"file") websave(zipFile,url); end fileName = fullfile(pwd,"personReIDResNet.mat"); if ~exist(fileName,"file") unzip(zipFile,pwd); end pretrained = load(fileName); net = pretrained.net; end
helperUnzipData
Unzip the training data ZIP file.
function helperUnzipData(folder) zipFile = fullfile(folder,"pedestrianDataset.zip"); dataFolder = fullfile(folder,"pedestrianDataset"); if ~exist(dataFolder,"dir") unzip(zipFile,folder); end end
References
[1] Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple Online and Realtime Tracking with a Deep Association Metric." In 2017 IEEE international conference on image processing (ICIP), 3645–49. Beijing: IEEE, 2017. https://doi.org/10.1109/ICIP.2017.8296962.
[2] Zheng, Liang, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian, "Scalable Person Re-identification: A Benchmark," In 2015 IEEE International Conference on Computer Vision (ICCV), 1116-24. Santiago, Chile: IEEE, 2015. https://doi.org/10.1109/ICCV.2015.133.