Matlab trainNetwork CNN training pauses iterating intermittently at random then continues

3 vues (au cours des 30 derniers jours)
I'm attempting to train a DnCNN network with a grayscale image patch dataset I've collected and aggregated into training and validation imageDatastore objects. I'm using trainNetwork to execute the training routine. When training on imageDatastore train and validation objects containing 50,000 and 5,000 files, respectively, training iterations appear to execute with the same time duration between each iteration (for example, it appears to take less than 1 second for each minibatch size of 128 to be completed and iterate to the next minibatch).
However, when I increase the amount of training and validation files in the imageDatastore objects passed into the trainNetwork function to 350,000 and 35,000, respectively, during training, random iterations appear to hang/pause such that the time duration for the "paused" iteration is 20-30 seconds longer than the normal ~1 second per iteration timeframe. This pausing happens intermittently and frequently significantly increasing my training time and I don't understand why. My memory resources via RAM and GPU have plenty of available memory during training and modification of batchsize, learning rate and optimizer (ADAM, SGDM) do not eliminate this pausing action. The problem appears to be directly related to the number of files in the imageDatastore objects used for training.
Has anyone dealt with this before? Is there some type of data cleanup action being performed via trainNetwork that is executing causing iterations to pause randomly when the imageDatastore objects contain large numbers of files?
Any insight would be greatly appreciated! Thanks

Réponses (1)

Joss Knight
Joss Knight le 11 Août 2022
Is the pause associated with a validation measurement being added to the training plot? With 7 times as much validation data it will take 7 times longer to take a validation measurement.
  3 commentaires
Joss Knight
Joss Knight le 12 Août 2022
Hi. It does seem to be an issue with your datastores...you could try reading your entire dataset in a loop to see whether the behaviour reproduces, like:
imds.ReadSize = miniBatchSize;
for i = 1:maxEpochs
reset(imds);
shuffle(imds);
while hasdata(imds)
imds.read();
end
end
This sort of randomness speaks to either a multithreading or a file system issue. If it's a multithreading issue you could force your datastores to run serially using a ReadFcn, e.g.
imds.ReadFcn = @imread;
This will make everything a lot slower but if the random pauses go away it implies you have some sort of multithreading issue.
If it is a multithreading issue it's probably harmless. The file i/o queue may be backed up and the thread management system may be periodically flushing the queue. All this means is that file i/o cost has been shifted around so it is bunched up in one place, rather than necessarily meaning you've lost performance overall.
If it's a file system issue then that's out of MATLAB's hands.
Nicholas Hopkins
Nicholas Hopkins le 12 Août 2022
Joss, copy all and thank you for the quick responses and troubleshooting tips. The iteration pausing is definitely an interesting deviation from what generally is normal program/training execution when the datastores scale up in size. I'll take a look at analyzing the imagedatastores with the troubleshooting tips you suggested and will hopefully update this thread with an explanation for this training routine behavior; however, it may be a few days before I get back to running this code and focusing on this area of my research so I'll standby on accepting your answer until I look into some of what you suggested.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Parallel and Cloud dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by