## Deep Q-Network Agents

The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. DQN is a variant of Q-learning. For more information on Q-learning, see Q-Learning Agents.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

DQN agents can be trained in environments with the following observation and action spaces.

Observation SpaceAction Space
Continuous or discreteDiscrete

DQN agents use the following critic representation.

CriticActor

Q-value function critic Q(S,A), which you create using `rlQValueRepresentation`

DQN agents do not use an actor.

During training, the agent:

• Updates the critic properties at each time step during learning.

• Explores the action space using epsilon-greedy exploration. During each control interval, the agent either selects a random action with probability ϵ or selects an action greedily with respect to the value function with probability 1-ϵ. This greedy action is the action for which the value function is greatest.

• Stores past experiences using a circular experience buffer. The agent updates the critic based on a mini-batch of experiences randomly sampled from the buffer.

### Critic Function

To estimate the value function, a DQN agent maintains two function approximators:

• Critic Q(S,A) — The critic takes observation S and action A as inputs and returns the corresponding expectation of the long-term reward.

• Target critic Q'(S,A) — To improve the stability of the optimization, the agent periodically updates the target critic based on the latest critic parameter values.

Both Q(S,A) and Q'(S,A) have the same structure and parameterization.

For more information on creating critics for value function approximation, see Create Policy and Value Function Representations.

When training is complete, the trained value function approximator is stored in critic Q(S,A).

### Agent Creation

You can create a DQN agent with a critic representation based on the observation and action specifications from the environment. To do so, perform the following steps.

1. Create observation specifications for your environment. If you already have an environment interface object, you can obtain these specifications using `getObservationInfo`.

2. Create action specifications for your environment. If you already have an environment interface object, you can obtain these specifications using `getActionInfo`.

3. If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object using `rlAgentInitializationOptions`.

4. If needed, specify agent options using an `rlDQNAgentOptions` object.

5. Create the agent using an `rlDQNAgent` object.

Alternatively, you can create actor and critic representations and use these representations to create your agent. In this case, ensure that the input and output dimensions of the actor and critic representations match the corresponding action and observation specifications of the environment.

1. Create a critic using an `rlQValueRepresentation` object.

2. Specify agent options using an `rlDQNAgentOptions` object.

3. Create the agent using an `rlDQNAgent` object.

DQN agents support critics that use recurrent deep neural networks as functions approximators.

For more information on creating actors and critics for function approximation, see Create Policy and Value Function Representations.

### Training Algorithm

DQN agents use the following training algorithm, in which they update their critic model at each time step. To configure the training algorithm, specify options using an `rlDQNAgentOptions` object.

• Initialize the critic Q(s,a) with random parameter values θQ, and initialize the target critic with the same values: ${\theta }_{Q\text{'}}={\theta }_{Q}$.

• For each training time step:

1. For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.

`$A=\underset{A}{\mathrm{arg}\mathrm{max}}Q\left(S,A|{\theta }_{Q}\right)$`

To specify ϵ and its decay rate, use the `EpsilonGreedyExploration` option.

2. Execute action A. Observe the reward R and next observation S'.

3. Store the experience (S,A,R,S') in the experience buffer.

4. Sample a random mini-batch of M experiences (Si,Ai,Ri,S'i) from the experience buffer. To specify M, use the `MiniBatchSize` option.

5. If S'i is a terminal state, set the value function target yi to Ri. Otherwise, set it to

`$\begin{array}{ll}\begin{array}{l}{A}_{\mathrm{max}}=\underset{A\text{'}}{\mathrm{arg}\mathrm{max}}Q\left({S}_{i}\text{'},A\text{'}|{\theta }_{Q}\right)\\ {y}_{i}={R}_{i}+\gamma Q\text{'}\left({S}_{i}\text{'},{A}_{\mathrm{max}}|{\theta }_{Q\text{'}}\right)\end{array}\hfill & \left(\text{double}\text{\hspace{0.17em}}\text{DQN}\right)\hfill \\ \hfill & \hfill \\ {y}_{i}={R}_{i}+\gamma \underset{A\text{'}}{\mathrm{max}}Q\text{'}\left({S}_{i}\text{'},A\text{'}|{\theta }_{Q\text{'}}\right)\hfill & \left(\text{DQN}\right)\hfill \end{array}\text{\hspace{0.17em}}$`

To set the discount factor γ, use the `DiscountFactor` option. To use double DQN, set the `UseDoubleDQN` option to `true`.

6. Update the critic parameters by one-step minimization of the loss L across all sampled experiences.

`$L=\frac{1}{M}\sum _{i=1}^{M}{\left({y}_{i}-Q\left({S}_{i},{A}_{i}|{\theta }_{Q}\right)\right)}^{2}$`
7. Update the target critic parameters depending on the target update method For more information, see Target Update Methods.

8. Update the probability threshold ϵ for selecting a random action based on the decay rate you specify in the `EpsilonGreedyExploration` option.

### Target Update Methods

DQN agents update their target critic parameters using one of the following target update methods.

• Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the `TargetSmoothFactor` option.

`${\theta }_{Q\text{'}}=\tau {\theta }_{Q}+\left(1-\tau \right){\theta }_{Q\text{'}}$`
• Periodic — Update the target parameters periodically without smoothing (`TargetSmoothFactor = 1`). To specify the update period, use the `TargetUpdateFrequency` parameter.

• Periodic Smoothing — Update the target parameters periodically with smoothing.

To configure the target update method, create a `rlDQNAgentOptions` object, and set the `TargetUpdateFrequency` and `TargetSmoothFactor` parameters as shown in the following table.

Update Method`TargetUpdateFrequency``TargetSmoothFactor`
Smoothing (default)`1`Less than `1`
PeriodicGreater than `1``1`
Periodic smoothingGreater than `1`Less than `1`

## References

[1] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing Atari with Deep Reinforcement Learning.” ArXiv:1312.5602 [Cs], December 19, 2013. https://arxiv.org/abs/1312.5602.