Epsilon-greedy Algorithm in RL DQN
3 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
ches
le 30 Nov 2020
Commenté : Cecilia S.
le 9 Juin 2021
Hello,
I'm currently training a DQN agent for my RL problem. As the training progresses, I can see that the episode reward, running average and Q0 converge to (approximately) the same value, which is a good sign. However, I am uncertain if indeed it is able to find the optimal policy, or it just gets stuck in a local minimum.
With this, I have the following questions about exploration using the epsilon-greedy algorithm (with configurable parameters in rlDQNAgentOptions).
1. Does the epsilon decay every time step and continuously over all episodes (meaning, it does not reset to epsilon-max at the start of every new episode)?
2. Do the number of time steps per episode and the total number of episodes have a direct impact on the exploration process? Or are there other parameters which affect exploration besides the epsilon parameters?
3. How is the Q0 estimate calculated? Is it solely based on the output of my DNN policy representation?
4. How is the episode reward calculated? My understanding is that, it is just the sum of the actual rewards for all time steps within an episode.
Thank you in advance for your help! :)
0 commentaires
Réponse acceptée
Emmanouil Tzorakoleftherakis
le 30 Nov 2020
Hello,
First off, RL typically solves a complex nonlinear optimization problem. So at the end of the day, you will most certainly not get a global solution, but a local one. So the question becomes how good that local solution is compared to some other one.
Some comments to your questions:
1. I believe it does not reset after an episode, yes.
2.Exploration for DQN in Reinforcement Learning Toolbox is primarily determined by the epsilon. Of course, given that this is still a trial-and-error method in a way, number of steps and episodes may play a role in how well you learn but I don't think you have much control over it. For example, if during an episode an agent is at a good spot and exploring in a part of the state space that is critical, you don't want to hit the maximum number of steps and terminate the episode. But as I said I don't think you have much control over that. What you can do is make sure to reset and randomize your environment state so that the agent gets to explore different parts of the state space.
3. I believe so. It is using the observation values at the beginning of the episode to calculate how much potential that state has.
4.Correct. There is a distinction between the reward you see in the Episode manager and the reward used by the DQN algorithm, in that the latter also considers the discount factor.
Hope that helps
4 commentaires
Cecilia S.
le 9 Juin 2021
Hello, I have a question concerning the episode reward:
Why is it that it does not consider the discount factor? How is it calculated exactly? I have a system that has had some VERY negative rewards that I cannot account for and I would like to have more information on it
Plus de réponses (0)
Voir également
Catégories
En savoir plus sur Training and Simulation dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!