Question to convergence of q0 and average reward

Question

Kun Cheng le 22 Juil 2023

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/1999373-question-to-convergence-of-q0-and-average-reward

Réponse apportée : Rishi le 4 Jan 2024

Hi guys,

I am training a ddqn-agent. Average Reward can get converged about 1000 episodes. but q0 value need 3000 episodes more to get converged. can i stop the training after convergence of Average Reward?

If not, how can i accelerate convergence of q0. as i know, q0 gives the prediction of target critic. How can i change the frequency of target critic update?

second question is: q0 converged at place which bigger than max reward. how can i fix the problem

just like this. max reward is around -6. but q0 is -3.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Ayush Aniket le 24 Août 2023

Hi Kun Cheng,

Can you share your code? I will try to replicate it on my end. It will help me answer the question.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Rishi le 4 Jan 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/1999373-question-to-convergence-of-q0-and-average-reward#answer_1382936

Ouvrir dans MATLAB Online

Hi Kun,

I understand from your query that you want to know if you can stop the training after average reward converges, and how to accelerate the convergence of q0.

Stopping training after the convergence of the average reward might be tempting, but it's essential to ensure that your Q-values (such as Q0) have also converged. The Q-values represent the expected return of taking an action in a given state and following a particular policy thereafter. If these values have not converged, it might mean that your agent hasn't learned the optimal policy yet, even if the average reward seems stable.

To accelerate the convergence of Q0, you can try the following steps:

Learning rate adjustment: You can try to adjust the learning rate of your optimizer. A smaller learning rate can lead to more stable but slower convergence, whereas a larger learning rate can speed up the convergence but might overshoot the optimal values. You can change the learning rate of the agent in the following way:

agent.AgentOptions.CriticOptimizerOptions.LearnRate= lr;

You can find more information about ‘rlDQNAgent’, ‘rlDQNAgentOptions’ and ‘rlOptimizerOptions’ from the below documentations: https://www.mathworks.com/help/reinforcement-learning/ref/rl.agent.rldqnagent.html https://www.mathworks.com/help/reinforcement-learning/ref/rldqnagentoptions.html https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rloptimizeroptions.html
Experience replay: Make effective use of experience replay. By sampling from diverse set of past experiences, the agent can learn more efficiently. You can learn more about it from the below documentation: https://www.mathworks.com/help/reinforcement-learning/ref/rl.replay.rlreplaymemory.html
Target update frequency: Try changing the ‘TargetUpdateFrequency’ parameter of the ‘rlDQNAgentOptions’ function to change the frequency of the target network update. Increasing the frequency can lead to faster learning but might reduce the stability of the learning process. You can learn more about ‘Target Update Methods’ from the given link: https://www.mathworks.com/help/reinforcement-learning/ug/dqn-agents.html#mw_46a25460-7793-4671-9169-1075b5ea3f3e

In addition to these, you can try other Reinforcement Learning techniques such as Reward Shaping, Exploration Strategies, and Regularization Techniques like dropout or L2 regularization.

If the Q0 value is converging to a value greater than the maximum possible reward, this might be a sign of overestimation bias. To address this, you can try the following methods:

Clipping rewards: Clip the rewards during the training to prevent excessively high values.
Huber Loss: Instead of mean squared error for the loss function, try using Huber loss, which is less sensitive to outliers.
Regularization: Use regularization techniques to prevent the network from assuming to high values to Q-estimates.

Hope this helps!

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Question to convergence of q0 and average reward

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Community Treasure Hunt

Question to convergence of q0 and average reward

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens