Effacer les filtres
Effacer les filtres

Question to convergence of q0 and average reward

12 vues (au cours des 30 derniers jours)
Kun Cheng
Kun Cheng le 22 Juil 2023
Hi guys,
I am training a ddqn-agent. Average Reward can get converged about 1000 episodes. but q0 value need 3000 episodes more to get converged. can i stop the training after convergence of Average Reward?
If not, how can i accelerate convergence of q0. as i know, q0 gives the prediction of target critic. How can i change the frequency of target critic update?
second question is: q0 converged at place which bigger than max reward. how can i fix the problem
just like this. max reward is around -6. but q0 is -3.
  1 commentaire
Ayush Aniket
Ayush Aniket le 24 Août 2023
Can you share your code? I will try to replicate it on my end. It will help me answer the question.

Connectez-vous pour commenter.

Réponse acceptée

Rishi
Rishi le 4 Jan 2024
Hi Kun,
I understand from your query that you want to know if you can stop the training after average reward converges, and how to accelerate the convergence of q0.
Stopping training after the convergence of the average reward might be tempting, but it's essential to ensure that your Q-values (such as Q0) have also converged. The Q-values represent the expected return of taking an action in a given state and following a particular policy thereafter. If these values have not converged, it might mean that your agent hasn't learned the optimal policy yet, even if the average reward seems stable.
To accelerate the convergence of Q0, you can try the following steps:
  • Learning rate adjustment: You can try to adjust the learning rate of your optimizer. A smaller learning rate can lead to more stable but slower convergence, whereas a larger learning rate can speed up the convergence but might overshoot the optimal values. You can change the learning rate of the agent in the following way:
agent.AgentOptions.CriticOptimizerOptions.LearnRate= lr;
In addition to these, you can try other Reinforcement Learning techniques such as Reward Shaping, Exploration Strategies, and Regularization Techniques like dropout or L2 regularization.
If the Q0 value is converging to a value greater than the maximum possible reward, this might be a sign of overestimation bias. To address this, you can try the following methods:
  • Clipping rewards: Clip the rewards during the training to prevent excessively high values.
  • Huber Loss: Instead of mean squared error for the loss function, try using Huber loss, which is less sensitive to outliers.
  • Regularization: Use regularization techniques to prevent the network from assuming to high values to Q-estimates.
Hope this helps!

Plus de réponses (0)

Catégories

En savoir plus sur Linear Algebra dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by