RL Agent learns a constant trajectory instead of actual trajectory

Question

Vasu Sharma le 31 Jan 2024

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/2076771-rl-agent-learns-a-constant-trajectory-instead-of-actual-trajectory

Commenté : Emmanouil Tzorakoleftherakis le 11 Fév 2024

Results_temp.png

Hi,

I have a conceptual question to my problem. I am trying to learn an Engine control model with a DDPG agent, whee I have an LSTM Model for my Engine as a plant. I simulate the engine for a given random trajectory, and use the engine outputs, along with engine states( LSTM states) and the load trajectory as the observation model for my agent.

I am trying to train the DDPG agent by asking it to follow a reference load trajectory as below ( dashed line in top left graph ). I have observed that despite trying various network architectures/noise options & learning rates, the learnt model agent chooses to just deliver a constant load of around 6 ( orange line in the top left graph), rather than follow the given refernece trajectory. The outputs seem to vary reasonably ( here in blue ) but the learning is still not acceptable.

I am tweaking the trajectory every episode to aid learning as then it can see varios load profiles.

Could you kindly advise what might be going on here?

Additional Information: The same effect happens if I ask the controller to match a constant load trajectory ( constnat per episode, then changes to another random constant for the next episode ). I have attached my code here

Thanks in advance :)

Code:

%% H2DF DDPG Trainer
%
% clc
% clear all
% close all
ObsInfo.Name = "Engine Outputs";
ObsInfo.Description = ' IMEP, NOX, SOOT, MPRR ';
%% Creating envirement
obsInfo = rlNumericSpec([16 1],...
 'LowerLimit',[-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf]',...
 'UpperLimit',[inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]');
obsInfo.Name = "Engine Outputs";
obsInfo.Description = ' IMEP, NOX, SOOT, MPRR, IMEP_t-1,IMEP_ref,IMEP_ref_t-1, IMEP_error, states';
numObservations = obsInfo.Dimension(1);
actInfo = rlNumericSpec([4 1],'LowerLimit',[0.17e-3;440;-1;1e-3],'UpperLimit',[0.5e-3;440;-1;5.5e-3]);
actInfo.Name = "Engine Inputs";
actInfo.Description = 'DOI, P2M, SOI, DOI_H2';
numActions = actInfo.Dimension(1);
env = rlSimulinkEnv('MPC_RL_H2DF','MPC_RL_H2DF/RL Agent',...
 obsInfo,actInfo);
env.ResetFcn = @(in)localResetFcn(in);
Ts = 0.08;
Tf = 20;
% 375 engine cycle results
rng(0)
% 1200 - 0.1| 1900: 0.06
%% Createing Agent
L = 60; % number of neurons
statePath = [
    sequenceInputLayer(numObservations, 'Normalization', 'none', 'Name', 'observation')
    fullyConnectedLayer(L, 'Name', 'fc1')
    reluLayer('Name', 'relu1')
    fullyConnectedLayer(L, 'Name', 'fc11')
    reluLayer('Name', 'relu11')
    fullyConnectedLayer(2*L, 'Name', 'fc12')
    reluLayer('Name', 'relu12')
    fullyConnectedLayer(4*L, 'Name', 'fc15')
    reluLayer('Name', 'relu15')
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(2*L)
    reluLayer
    fullyConnectedLayer(L, 'Name', 'fc2')
    concatenationLayer(1,2,'Name','add')
    reluLayer('Name','relu2')
    fullyConnectedLayer(L, 'Name', 'fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(8*L)
    reluLayer   
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    gruLayer(4)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(L, 'Name', 'fc7')
    reluLayer('Name','relu7')
    fullyConnectedLayer(1, 'Name', 'fc4','BiasInitializer','ones','WeightsInitializer','he')];
actionPath = [
    sequenceInputLayer(numActions, 'Normalization', 'none', 'Name', 'action')
    fullyConnectedLayer(2*L, 'Name', 'fc6')
    reluLayer('Name','relu6')
    fullyConnectedLayer(4*L, 'Name', 'fc13')
    reluLayer('Name','relu13')
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(2*L, 'Name', 'fc14')
    reluLayer('Name','relu14')
    fullyConnectedLayer(L, 'Name', 'fc5','BiasInitializer','ones','WeightsInitializer','he')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
    
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');
figure
plot(criticNetwork)
criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
 'Observation',{'observation'},'Action',{'action'},criticOptions);
%%
actorNetwork = [
    sequenceInputLayer(numObservations, 'Normalization', 'none', 'Name', 'observation')
    fullyConnectedLayer(L, 'Name', 'fc1')
    reluLayer('Name', 'relu1')
    fullyConnectedLayer(2*L, 'Name', 'fc2')
    reluLayer('Name', 'relu2')
    fullyConnectedLayer(2*L, 'Name', 'fc3')
    reluLayer('Name', 'relu3')
    fullyConnectedLayer(4*L, 'Name', 'fc8')
    reluLayer('Name', 'relu8')
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    gruLayer(4)
    reluLayer
    fullyConnectedLayer(8*L)
    reluLayer
    fullyConnectedLayer(4*L)
    reluLayer
    fullyConnectedLayer(2*L, 'Name', 'fc9')
    reluLayer('Name', 'relu9')
    fullyConnectedLayer(L, 'Name', 'fc10')
    reluLayer('Name', 'relu10')
    fullyConnectedLayer(numActions, 'Name', 'fc4')
    tanhLayer('Name','tanh1')
    scalingLayer('Name','ActorScaling1','Scale',-(actInfo.UpperLimit-actInfo.LowerLimit)/2,'Bias',(actInfo.UpperLimit+actInfo.LowerLimit)/2)];
actorOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,...
 'Observation',{'observation'},'Action',{'ActorScaling1'},actorOptions);
%% Deep Deterministic Policy Gradient (DDPG) agent
agentOpts = rlDDPGAgentOptions(...
 'SampleTime',Ts,...
 'TargetSmoothFactor',1e-3,...
 'DiscountFactor',0.99, ...,
 'MiniBatchSize',128, ...
 'SequenceLength',8,...
 'ExperienceBufferLength',1e5, ...
 'TargetUpdateFrequency', 10);
% agentOpts.NoiseOptions.Variance =
% [0.005*(70/sqrt(Ts));0.005*(12/sqrt(Ts));0.005*(0.4/sqrt(Ts))] v01
agentOpts.NoiseOptions.Variance = [1e-5*0.0025*(80/sqrt(Ts));1e2*0.003*(12/sqrt(Ts));30*0.003*(0.4/sqrt(Ts));1e-1*0.00025*(0.4/sqrt(Ts))];
agentOpts.NoiseOptions.Variance =10*[1.65000000000000e-05;0;0;0.000225000000000000];
agentOpts.NoiseOptions.VarianceDecayRate = [1e-5;1e-5;1e-5;1e-5];
criticOptions.UseDevice = "gpu";
actorOptions.UseDevice = "gpu";
% agent = rlDDPGAgent(actor,critic,agentOpts);
% variance*ts^2 = (0.01 - 0.1)*(action range)
% At each sample time step, the noise model is updated using the following formula, where Ts is the agent sample time.
% 
% x(k) = x(k-1) + MeanAttractionConstant.*(Mean - x(k-1)).*Ts
%        + Variance.*randn(size(Mean)).*sqrt(Ts)
% At each sample time step, the variance decays as shown in the following code.
% 
% decayedVariance = Variance.*(1 - VarianceDecayRate);
% Variance = max(decayedVariance,VarianceMin);
% For continuous action signals, it is important to set the noise variance appropriately to encourage exploration. It is common to have Variance*sqrt(Ts) be between 1% and 10% of your action range.
% 
% If your agent converges on local optima too quickly, promote agent exploration by increasing the amount of noise; that is, by increasing the variance. Also, to increase exploration, you can reduce the VarianceDecayRate.
%% Training agent
maxepisodes = 10000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
 'MaxEpisodes',maxepisodes, ...
 'MaxStepsPerEpisode',maxsteps, ...
 'ScoreAveragingWindowLength',100, ...
 'Verbose',true, ...
 'UseParallel',false,...
 'Plots','training-progress',...
 'StopTrainingCriteria','AverageReward',...
 'StopTrainingValue',0,...
 'SaveAgentCriteria','EpisodeReward','SaveAgentValue',-0.05');
%%
% % Set to true, to resume training from a saved agent
 resumeTraining = false;
% % Set ResetExperienceBufferBeforeTraining to false to keep experience from the previous session
 agentOpts.ResetExperienceBufferBeforeTraining = ~(resumeTraining);
if resumeTraining
    % Load the agent from the previous session
    sprintf('- Resume training of: %s', 'agentV04.mat');   
    trained_agent = load('D:\Masters\HiWi\h2dfannbasedmpc\acados_implementation\rl\savedAgents\Agent253.mat');
    agent = trained_agent.saved_agent ;
else
    % Create a fresh new agent
    agent = rlDDPGAgent(actor, critic, agentOpts);
end
% agent = rlDDPGAgent(actor, critic, agentOpts);
% agent = rlDDPGAgent(actor,critic,agentOpts);
%% Train the agent
trainingStats = train(agent, env, trainOpts);
%trainingStats = train(agent,env,trainOpts);
% get the agent's actor, which predicts next action given the current observation
actor       = getActor(agent);
% get the actor's parameters (neural network weights)
%actorParams = getLearnableParameterValues(actor);

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Emmanouil Tzorakoleftherakis le 31 Jan 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2076771-rl-agent-learns-a-constant-trajectory-instead-of-actual-trajectory#answer_1400671

Modifié(e) : Emmanouil Tzorakoleftherakis le 31 Jan 2024

Thanks for adding all the details. The first thing I will say is that the average reward on the Episode Manager is moving in the right direction. So from the point of view of RL algorithm, it's learning... something. If that something is not what you would expect, maybe you should revisit the reward signal you have put together and make sure it makes sense. That would be my first instinct.

Another point, I noticed that the upper and lower limits for a couple of your actions are the same (e.g. 440 and -1). Is that expected? You can see this is the case in the respective blue curves as well.

2 commentaires
Afficher AucuneMasquer Aucune

Vasu Sharma le 10 Fév 2024

Hi @Emmanouil Tzorakoleftherakis,

Thanks for your response.

I revisted my reward formulation and there doesn't seem to be something obviously wrong there. The reward is simply the negation of a squared error cost term for the trajectory.Could you suggest some other things to check/tweak?

There is however, something odd that I have noticed. I have the reward signal going into the RL agent in simulink as input on a scope I see that it does not show anything for the first time step. See attached image below.

However, when the episode ends it usually takes up a value and this one is quite high as compare to other reward values in the episode. I have tried my best to capture it but since this is instantaneous, its tricky to capture. I have attached the images but you can also refer to the image below. ( look at inital time step).

I am wondering if this is normal. Is this somehow related to the Q_0 or just a normal way for the scopes to work. What would be your opinion?

Thank in advance,

Vasu

Emmanouil Tzorakoleftherakis le 11 Fév 2024

Ouvrir dans MATLAB Online

What you mention seems normal since the agent needs to take a step first to be able to collect a reward. Two things I would look at next are:

1) The upper and lower limits:

actInfo = rlNumericSpec([4 1],'LowerLimit',[0.17e-3;440;-1;1e-3],'UpperLimit',[0.5e-3;440;-1;5.5e-3]);

This line that you have implies that the second and third action have the same upper and lower limits so their values are essentially always constrained to 440 and -1. No reason to use those as actions if that's the case.

2) Your neural networks architectures seem unnecessarily large. Plus I wouldn't use a sequenceinputlayer since these are not lstm networks. A featureinputlayer would work. In fact, I would let the software come up with an initial architecture and modify it afterwards if needed. Take a look here to see how this can be accomplished.

Connectez-vous pour commenter.

RL Agent learns a constant trajectory instead of actual trajectory

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (1)

2 commentaires
Afficher AucuneMasquer Aucune

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

RL Agent learns a constant trajectory instead of actual trajectory

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (1)

2 commentaires Afficher AucuneMasquer Aucune

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

2 commentaires
Afficher AucuneMasquer Aucune