RL DQN agent Episode Q0 does not converge to Average Reward

16 vues (au cours des 30 derniers jours)
Amin Moradi
Amin Moradi le 24 Fév 2022
Réponse apportée : Ronit le 16 Fév 2024
I'm using Reinforcement Learning Toolbox for MATLAB R2021b and I'm training a DQN agent. After choosing a appropriate discount factor and other parameters, it seems that my average rewards are good and correct but the problem is my Epsiode Q0 won't converge to Average Rewards. I have attached the training results. I would be grateful if someone can help me on correcting this or informing me of the possible reasons that this error would happen. Here is my code for training part, you can see the training parameters in the code:
ObservationInfo = rlNumericSpec([1 11]);
ObservationInfo.Name = 'Line State';
ObservationInfo.Description = 'line1, line2, line3, line4, line5, line6, line7, line8, line9, line10, line11';
ObservationInfo.LowerLimit=0;
ObservationInfo.UpperLimit=1;
ActionInfo = rlFiniteSetSpec([1 2 3 4 5 6 7 8 9 10 11]);
ActionInfo.Name = 'Attacker Action';
ActionInfo.Description = ['attack-line1, attack-line2, attack-line3, attack-line4, ' ...
'attack-line5, attack-line6, attack-line7, attack-line8, attack-line9, attack-line10, attack-line11'];
env = rlFunctionEnv(ObservationInfo, ActionInfo,'WW6_StepFunction_genloss','WW6_ResetFunction');
dnn = [
featureInputLayer(obsInfo.Dimension(2),'Normalization','none','Name','state')
fullyConnectedLayer(120,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(120, 'Name','CriticStateFC2')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(length(actInfo.Elements),'Name','output')];
criticOpts = rlRepresentationOptions('LearnRate',0.001,'GradientThreshold',1);
critic = rlQValueRepresentation(dnn,obsInfo,actInfo,'Observation',{'state'},criticOpts);
agentOpts = rlDQNAgentOptions(...
'NumStepsToLookAhead',1,... % used for parallel computing
'UseDoubleDQN',true, ...
'TargetSmoothFactor',1e-1, ...
'TargetUpdateFrequency',4, ...
'ExperienceBufferLength',100000, ...
'DiscountFactor',0.7, ...
'MiniBatchSize',256 ...
);
agentOpts.EpsilonGreedyExploration.Epsilon=1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay=0.005;
agentOpts.EpsilonGreedyExploration.EpsilonMin=0.1;
agent = rlDQNAgent(critic,agentOpts);
trainOpts = rlTrainingOptions(...
'UseParallel',true,... % used for parallel computing
'MaxEpisodes',8000, ...
'MaxStepsPerEpisode',5, ...
'Verbose',false, ...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',900);
trainOpts.ScoreAveragingWindowLength=20;
trainingStats = train(agent,env,trainOpts);

Réponses (1)

Ronit
Ronit le 16 Fév 2024
Hi,
I've noticed your concern regarding the convergence of Q0 on the track of average reward. It's important to recognize that if your model is already producing good results, the behaviour of Episode Q0 may not be a significant issue.
For more details regarding this you can refer to the following community answers:
Remember that reinforcement learning can be sensitive to hyperparameter settings and requires a lot of trial and error to find the right combination for a given problem. Should you decide to align Episode Q0 with the average reward track more closely, here are some adjustments you might consider:
  • Epsilon Decay Rate: Adjust the epsilon decay rate to ensure enough exploration throughout the training.
  • Larning rate: Experiment with different learning rates.
  • Discount Factor: Adjust the discount factor to better balance immediate and future rewards.
  • Target Network Update Frequency: Change the target network update frequency to improve stability.
  • Episodes: Increase the number of episodes or steps per episode.
  • Reward Function: Review and possibly redesign the reward function.
  • Step and Reset Functions: Check the implementation of your environment's step and reset functions for potential issues.
You can also use Bayesian Optimization, a framework provided by MATLAB through the ‘bayesoptfunction. It is an efficient method for global optimization of black-box functions that can be used to tune hyperparameters of an RL agent.
Hope this helps!
Ronit Jain

Produits


Version

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by