Epsilon greedy algorithm and environment reset do not work during DQN agent training
7 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Matteo Padovani
le 22 Fév 2021
Commenté : Weihao Zhou
le 14 Avr 2021
I'm using the reinforcement learning toolbox to design and train a DQN agent. The action space of the agent is composed by 24 discreted actions that reppresent 24 locations on a grid map, the agent action is to select one of them as target point and move there. The environment is a custom environment which I've designed defining custom reset and step functions. As suggested in the documentation and in some answers in order to promote the exploration of the action space the reset function sets the starting point of the agent randomly.
The issue I'm facing is the fact that during training (except for the very early episodes) at the beginning of each episode the agent starting position is always the same as the environment reset did not run, furthermore the agent performs the same actions in a loop as the greedy algorithm doesn't work. Here an example of what I mean:
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 6/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 6.79 | Step Count : 40 | Episode Q0 : 0.19
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 7/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 6.30 | Step Count : 45 | Episode Q0 : 0.19
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 8/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 7.46 | Step Count : 50 | Episode Q0 : 0.19
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 9/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 7.69 | Step Count : 55 | Episode Q0 : 0.19
% And so on for 20 Episodes.....
Every time an episode starts the environment should be reset therefore I would expect a different starting point for every episode, moreover it seems that the greedy algorithms does not work in fact I expect the agent to perform random actions to explore the action space since for the first episodes the Epsilon is very high, these are my settings for the greedy algorithm:
agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.001;
Therefore my questions, are the following:
- In my reset function the random initial position is selected randomly from a set of location through the function randi(), there may be a problem with the rng settings for reproducibility? There is a special setting to make the initial random position to be really random?
- I would like to understand how the greedy algorithm works and if there is a way to make the agent explore in an intense manner for the first episodes avoiding same action selection.
- Are there other agent/training parametrs that may affect exploration/exploitation during training?
Thank you in advance for your help and your time!
1 commentaire
Weihao Zhou
le 14 Avr 2021
Hello, I am a beginner in reinforcement learning. I would like to ask you where to observe the specific results of each episode of DQN training.
Thank you in advance for your help and your time!
Réponse acceptée
Emmanouil Tzorakoleftherakis
le 23 Fév 2021
Hello,
Here are some comments:
1.The reset function should not produce the same output. You should first doublecheck the reset function works as expected by calling it as a standalone function outside of RL training. Right now it seems there is some implementation issue in the function itself. Are you maybe somehow providing a fixed seed in randi?
2. Looking at the numbers you provided, the number of steps per episode changes, so it does not seem like the same actions are performed. I would give it some more time, maybe increase the EpsilonMin value a bit too just to double check.
3. For DQN, epsilon is the main one, but there would also be issues with the network design.
2 commentaires
Emmanouil Tzorakoleftherakis
le 5 Mar 2021
The ideal convergence scenario for DQN would we Q0 to approx track the average episode reward (not individual episode rewards). There is not standard recipe for this, it's all about hyperparam tuning
Plus de réponses (0)
Voir également
Catégories
En savoir plus sur Training and Simulation dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!