Epsilon greedy algorithm and environment reset do not work during DQN agent training

Question

Matteo Padovani le 22 Fév 2021

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/752864-epsilon-greedy-algorithm-and-environment-reset-do-not-work-during-dqn-agent-training

Commenté : Weihao Zhou le 14 Avr 2021

Réponse acceptée : Emmanouil Tzorakoleftherakis

I'm using the reinforcement learning toolbox to design and train a DQN agent. The action space of the agent is composed by 24 discreted actions that reppresent 24 locations on a grid map, the agent action is to select one of them as target point and move there. The environment is a custom environment which I've designed defining custom reset and step functions. As suggested in the documentation and in some answers in order to promote the exploration of the action space the reset function sets the starting point of the agent randomly.

The issue I'm facing is the fact that during training (except for the very early episodes) at the beginning of each episode the agent starting position is always the same as the environment reset did not run, furthermore the agent performs the same actions in a loop as the greedy algorithm doesn't work. Here an example of what I mean:

% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   6/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 6.79 | Step Count : 40 | Episode Q0 : 0.19
% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   7/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 6.30 | Step Count : 45 | Episode Q0 : 0.19
% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   8/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 7.46 | Step Count : 50 | Episode Q0 : 0.19
% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   9/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 7.69 | Step Count : 55 | Episode Q0 : 0.19
% And so on for 20 Episodes.....

Every time an episode starts the environment should be reset therefore I would expect a different starting point for every episode, moreover it seems that the greedy algorithms does not work in fact I expect the agent to perform random actions to explore the action space since for the first episodes the Epsilon is very high, these are my settings for the greedy algorithm:

agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.001;

Therefore my questions, are the following:

In my reset function the random initial position is selected randomly from a set of location through the function randi(), there may be a problem with the rng settings for reproducibility? There is a special setting to make the initial random position to be really random?
I would like to understand how the greedy algorithm works and if there is a way to make the agent explore in an intense manner for the first episodes avoiding same action selection.
Are there other agent/training parametrs that may affect exploration/exploitation during training?

Thank you in advance for your help and your time!

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Weihao Zhou le 14 Avr 2021

Hello, I am a beginner in reinforcement learning. I would like to ask you where to observe the specific results of each episode of DQN training.

Thank you in advance for your help and your time!

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Emmanouil Tzorakoleftherakis le 23 Fév 2021

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/752864-epsilon-greedy-algorithm-and-environment-reset-do-not-work-during-dqn-agent-training#answer_631219

Hello,

Here are some comments:

1.The reset function should not produce the same output. You should first doublecheck the reset function works as expected by calling it as a standalone function outside of RL training. Right now it seems there is some implementation issue in the function itself. Are you maybe somehow providing a fixed seed in randi?

2. Looking at the numbers you provided, the number of steps per episode changes, so it does not seem like the same actions are performed. I would give it some more time, maybe increase the EpsilonMin value a bit too just to double check.

3. For DQN, epsilon is the main one, but there would also be issues with the network design.

2 commentaires
Afficher AucuneMasquer Aucune

Matteo Padovani le 5 Mar 2021

Thank you for your hints!

1.The problem was a fixed seed in a function used for mapping.

3.Regarding the possible problems with DQN, i noticed that the Q0 at some point diverges drastically with respect to the episode rewards, how can i modify my arctitecture?

Emmanouil Tzorakoleftherakis le 5 Mar 2021

The ideal convergence scenario for DQN would we Q0 to approx track the average episode reward (not individual episode rewards). There is not standard recipe for this, it's all about hyperparam tuning

Connectez-vous pour commenter.

Epsilon greedy algorithm and environment reset do not work during DQN agent training

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

2 commentaires
Afficher AucuneMasquer Aucune

Plus de réponses (0)

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Epsilon greedy algorithm and environment reset do not work during DQN agent training

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

2 commentaires Afficher AucuneMasquer Aucune

Plus de réponses (0)

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

2 commentaires
Afficher AucuneMasquer Aucune