## Twin-Delayed Deep Deterministic Policy Gradient Agents

The twin-delayed deep deterministic policy gradient (TD3) algorithm is a model-free, online, off-policy reinforcement learning method. A TD3 agent is an actor-critic reinforcement learning agent that computes an optimal policy that maximizes the long-term reward.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

The TD3 algorithm is an extension of the DDPG algorithm. DDPG agents can overestimate value functions, which can produce suboptimal policies. To reduce value function overestimation includes the following modifications of the DDPG algorithm.

1. A TD3 agent learns two Q-value functions and uses the minimum value function estimate during policy updates.

2. A TD3 agent updates the policy and targets less frequently than the Q functions.

3. When updating the policy, a TD3 agent adds noise to the target action, which makes the policy less likely to exploit actions with high Q-value estimates.

You can use a TD3 agent to implement one of the following training algorithms, depending on the number of critics you specify.

• TD3 — Train the agent with two Q-value functions. This algorithm implements all three of the preceding modifications.

• Delayed DDPG — Train the agent with a single Q-value function. This algorithm trains a DDPG agent with target policy smoothing and delayed policy and target updates.

TD3 agents can be trained in environments with the following observation and action spaces.

Observation SpaceAction Space
Continuous or discreteContinuous

During training, a TD3 agent:

• Updates the actor and critic properties at each time step during learning.

• Stores past experience using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.

• Perturbs the action chosen by the policy using a stochastic noise model at each training step.

### Actor and Critic Function

To estimate the policy and value function, a TD3 agent maintains the following function approximators:

• Deterministic actor μ(S) — The actor takes observation S and outputs the corresponding action that maximizes the long-term reward.

• Target actor μ'(S) — To improve the stability of the optimization, the agent periodically updates the target actor based on the latest actor parameter values.

• One or two Q-value critics Qk(S,A) — The critics take observation S and action A as inputs and output the corresponding expectation of the long-term reward.

• One or two target critics Q'k(S,A) — To improve the stability of the optimization, the agent periodically updates the target critics based on the latest parameter values of the critics. The number of target critics matches the number of critics.

Both μ(S) and μ'(S) have the same structure and parameterization.

For each critic, Qk(S,A) and Q'k(S,A) have the same structure and parameterization.

When using two critics, Q1(S,A) and Q2(S,A), each critic can have a different structure, though TD3 works best when the critics have the same structure. When the critics have the same structure, they must have different initial parameter values.

When training is complete, the trained optimal policy is stored in actor μ(S).

For more information on creating actors and critics for function approximation, see Create Policy and Value Function Representations.

### Agent Creation

To create a TD3 agent:

1. Create an actor using an rlDeterministicActorRepresentation object.

2. Create one or two critics using rlQValueRepresentation objects.

3. Specify agent options using an rlTD3AgentOptions object.

4. Create the agent using an rlTD3Agent object.

### Training Algorithm

TD3 agents use the following training algorithm, in which they update their actor and critic models at each time step. To configure the training algorithm, specify options using rlDDPGAgentOptions. Here, K = 2 is the number of critics and k is the critic index.

• Initialize each critic Qk(S,A) with random parameter values θQk, and initialize each target critic with the same random parameter values: ${\theta }_{Qk\text{'}}={\theta }_{Qk}$.

• Initialize the actor μ(S) with random parameter values θμ, and initialize the target actor with the same parameter values: ${\theta }_{\mu \text{'}}={\theta }_{\mu }$.

• For each training time step:

1. For the current observation S, select action A = μ(S) + N, where N is stochastic noise from the noise model. To configure the noise model, use the ExplorationModel option.

2. Execute action A. Observe the reward R and next observation S'.

3. Store the experience (S,A,R,S') in the experience buffer.

4. Sample a random mini-batch of M experiences (Si,Ai,Ri,S'i) from the experience buffer. To specify M, use the MiniBatchSize option.

5. If S'i is a terminal state, set the value function target yi to Ri. Otherwise set it to:

${y}_{i}={R}_{i}+\gamma *\underset{k}{\mathrm{min}}\left({Q}_{k}\text{'}\left({S}_{i}\text{'},\text{clip}\left(\mu \text{'}\left({S}_{i}\text{'}|{\theta }_{\mu }\right)+\epsilon \right)|{\theta }_{Qk\text{'}}\right)\right)$

The value function target is the sum of the experience reward Ri and the minimum discounted future reward from the critics. To specify the discount factor γ, use the DiscountFactor option.

To compute the cumulative reward, the agent first computes a next action by passing the next observation S'i from the sampled experience to the target actor. Then, the agent adds noise ε to the computed action using the TargetPolicySmoothModel, and clips the action based on the upper and lower noise limits. The agent finds the cumulative rewards by passing the next action to the target critics.

6. At every time training step, update the parameters of each critic by minimizing the loss Lk across all sampled experiences.

${L}_{k}=\frac{1}{M}\sum _{i=1}^{M}{\left({y}_{i}-Q\left({S}_{i},{A}_{i}|{\theta }_{Q}\right)\right)}^{2}$

7. Every D1 steps, update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward. To set D1, use the PolicyUpdateFrequency option.

$\begin{array}{l}{\nabla }_{{\theta }_{\mu }}J\approx \frac{1}{M}\sum _{i=1}^{M}{G}_{ai}{G}_{\mu i}\\ {G}_{ai}={\nabla }_{A}\underset{k}{\mathrm{min}}\left({Q}_{k}\left({S}_{i},A|{\theta }_{Q}\right)\right)\text{ }\text{where}\text{\hspace{0.17em}}A=\mu \left({S}_{i}|{\theta }_{\mu }\right)\\ {G}_{\mu i}={\nabla }_{{\theta }_{\mu }}\mu \left({S}_{i}|{\theta }_{\mu }\right)\end{array}$

Here, Gai is the gradient of the minimum critic output with respect to the action computed by the actor network, and Gμi is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation Si.

8. Every D2 steps, update the target actor and critics depending on the target update method. To specify D2, use the TargetUpdateFrequency option. For more information, see Target Update Methods.

For simplicity, the actor and critic updates in this algorithm show a gradient update using basic stochastic gradient descent. The actual gradient update method depends on the optimizer specified using rlRepresentationOptions.

### Target Update Methods

TD3 agents update their target actor and critic parameters using one of the following target update methods.

• Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the TargetSmoothFactor option.

• Periodic — Update the target parameters periodically without smoothing (TargetSmoothFactor = 1). To specify the update period, use the TargetUpdateFrequency parameter.

$\begin{array}{l}{\theta }_{Qk\text{'}}={\theta }_{Qk}\\ {\theta }_{\mu \text{'}}={\theta }_{\mu }\end{array}$

• Periodic Smoothing — Update the target parameters periodically with smoothing.

To configure the target update method, create a rlTD3AgentOptions object, and set the TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.

Update MethodTargetUpdateFrequencyTargetSmoothFactor
Smoothing (default)1Less than 1
PeriodicGreater than 11
Periodic smoothingGreater than 1Less than 1

## References

[1] Fujimoto, Scott, Herke van Hoof, and David Meger. 'Addressing Function Approximation Error in Actor-Critic Methods'. ArXiv:1802.09477 [Cs, Stat], 22 October 2018. https://arxiv.org/abs/1802.09477.