How many critics and actors in MAPPO by centralized training?

4 vues (au cours des 30 derniers jours)
Yiwen Zhang
Yiwen Zhang le 12 Oct 2024
Commenté : Umar le 15 Oct 2024
When setting "centralized" learning strategy in rlMultiAgentTrainingOptions for PPO algorithm, do all the agents share one critic network? Does each agent have its own actor network?
In other words, for a MAPPO training task with N agents, does the number of critic and actor be 1 and N?

Réponses (1)

Umar
Umar le 13 Oct 2024

Hi @Yiwen Zhang ,

You mentioned, “When setting "centralized" learning strategy in rlMultiAgentTrainingOptions for PPO algorithm, do all the agents share one critic network? Does each agent have its own actor network? In other words, for a MAPPO training task with N agents, does the number of critic and actor be 1 and N?”

Please see my response to your comments below.

After reviewing the documentation provided at the link below,

https://www.mathworks.com/help/reinforcement-learning/ref/rlmultiagenttrainingoptions.html

A typical MAPPO setup using the rlMultiAgentTrainingOptions, the architecture of actor and critic networks under a centralized learning strategy can be summarized as follows:

Critic Network: When you set the learning strategy to "centralized", all agents indeed share a single critic network. This shared critic is beneficial as it allows for the aggregation of experiences across all agents, leading to more stable training and improved performance. By utilizing a centralized critic, the network can better estimate the value function by considering the actions and states of all agents, thus promoting cooperative behavior.

Actor Networks: Each agent maintains its own distinct actor network. This means that while the critic is shared, each agent's policy (actor) is independent. This design allows each agent to learn its unique strategy based on its observations and interactions within the environment while still benefiting from the collective knowledge provided by the shared critic.

In summary, for an MAPPO training task with N agents configured under a centralized learning strategy:

Number of Critic Networks: 1 (shared among all agents) Number of Actor Networks: N, (one for each agent)

When implementing this architecture, it is essential to ensure that the shared critic can adequately process inputs from all agents. This often involves designing input layers that can concatenate or otherwise aggregate observations from multiple agents. While sharing a critic can provide advantages, it may also introduce challenges such as increased complexity in managing gradients during backpropagation and potential interference among agents if their policies are not well-aligned.

Hope this helps.

Please let me know if you have any further questions.

  5 commentaires
Umar
Umar le 15 Oct 2024
Hi @ Yiwen Zhang,
All my questions have been answered by reviewing mathworks functions documentation. You did accept my answer. So far, I have helped many students and professors. Hope this helps clarify.
Umar
Umar le 15 Oct 2024

Hi @Yiwen Zhang ,

I took some time out to further elaborate on my comments and trying to help you out by providing a code snippet and summary of my theory and thoughts.

Please try to understand that in multi-agent reinforcement learning, where multiple agents share a single critic network, the process of updating the critic involves careful management of the replay buffer and the sampling of experiences. The provided code snippet outlines a basic framework for this process, and I will elaborate on how it addresses the concerns raised in the question.

Understanding the Code Structure

Initialization of Parameters: The code begins by defining essential parameters such as the number of agents, the size of the replay buffer, the mini-batch size, and the number of training iterations. This setup is crucial as it lays the groundwork for how the agents will interact with the environment and learn from their experiences.

numAgents = 3; % Number of agents
replayBufferSize = 1000; % Size of replay buffer
miniBatchSize = 32; % Size of mini-batch for updates
numIterations = 100; % Number of training iterations

Replay Buffer Initialization: The replay buffer is initialized with random data, simulating the experiences of the agents. Each entry in the buffer consists of a state, action, and reward, which are essential for training the critic network.

replayBuffer = zeros(replayBufferSize, stateDim + actionDim + 1);
% [state,action, reward]

Critic Network Creation: A simple neural network is defined to serve as the critic. This network takes the state as input and outputs a value estimate, which is crucial for evaluating the actions taken by the agents.

layers = [
  featureInputLayer(stateDim)
  fullyConnectedLayer(24)
  reluLayer
  fullyConnectedLayer(12)
  reluLayer
  fullyConnectedLayer(1)
  regressionLayer
];

The core of the question pertains to how the critic network is updated using a mini-batch sampled from the replay buffer. The updateCritic function is designed to handle this process.

Sampling a Mini-Batch: The function begins by randomly selecting indices from the replay buffer to create a mini-batch. This randomness is essential to ensure that the learning process is robust and not biased by the order of experiences.

indices = randi(size(replayBuffer, 1), miniBatchSize, 1);
miniBatch = replayBuffer(indices, :);

Extracting States and Rewards: From the sampled mini-batch, the states and rewards are extracted. The states are used as inputs to the critic network, while the rewards serve as the target values for training.

states = miniBatch(:, 1:stateDim); % Extract states
rewards = miniBatch(:, end); % Extract rewards

Computing Target Values: In this implementation, the target values for training are directly taken from the rewards. This is a straightforward approach, but in more complex scenarios, you might want to compute target values based on the Bellman equation or other methods to incorporate future rewards.

targetValues = rewards; % Use actual rewards for training

Training the Critic Network: The trainNetwork function is called to update the critic network using the states and target values. The training options specify the optimization algorithm and the number of epochs, ensuring that the network learns effectively.

options = trainingOptions('adam', 'MaxEpochs', 10, 'Verbose', 0);

Note: for more information on trainingOptions function, please refer to

https://www.mathworks.com/help/deeplearning/ref/trainingoptions.html

criticNetwork = trainNetwork(states, targetValues, 
criticNetwork.Layers,options);

Please see attached.

The current implementation initializes the replay buffer with random data. In practice, you would want to ensure that the buffer is filled with meaningful experiences collected from the agents' interactions with the environment. This can be achieved by appending experiences to the buffer during training. While using actual rewards as target values is a valid approach, consider implementing a more sophisticated method for calculating target values, such as using the critic's own predictions or incorporating future rewards. This can enhance the learning process and lead to better performance.

Also, depending on the complexity of the task, you might want to consider adding batch normalization layers or dropout layers to the critic network to improve generalization and stability during training. It would be beneficial to include metrics for monitoring the performance of the critic network during training, such as loss values or average rewards, to assess the effectiveness of the learning process.

As you continue to develop this framework, consider the technical feedback provided to refine the learning process and improve the overall performance of your multi-agent system.

Hope this helps.

Connectez-vous pour commenter.

Produits


Version

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by