How to create an custom Reinforcement Learning Environment + DDPG agent
Afficher commentaires plus anciens
Hello,
i´m working on an Agent for a problem in the spectral domain. I want to dump frequencies in a spectrum in a way that the resulting spectrum is looking like a rect() function.
So i created the following Environment with a [3 1] continuous Observation and Actionspace as an abstract Version of the real Problem. The initial Observation is a rndm [3 1] vector with values between 0 and 2. The Actionspace is a [3 1] vector with values between -1 and 1. The reward function is maximal when every vector element equals a final vector like[1 1 1].
the Environment file looks like this:
classdef SimpleContinuousEnv < rl.env.MATLABEnvironment
%SIMPLETESTENV: Template for defining custom environment in MATLAB.
%% Properties (set properties' attributes accordingly)
properties
% Specify and initialize environment's necessary properties
MaxForce=1
% Reward for a good correlation between observation and
% targetfunction
RewardForGoodShaping = 10
% Penalty for a bad correlation beween observation and
% targetfunction
PenaltyForBadShaping = -10
end
properties
% Initialize system state
State = zeros(1,3)
end
properties(Access = protected)
% Initialize internal flag to indicate episode termination
IsDone = false
end
%% Necessary Methods
methods
% Contructor method creates an instance of the environment
% Change class name and constructor name accordingly
function this = SimpleContinuousEnv()
% Initialize Observation settings
ObservationInfo = rlNumericSpec([3 1],'UpperLimit',[5;5;5],'LowerLimit',[-5;-5;-5]);
ObservationInfo.Name = 'Test SpectraObs';
ObservationInfo.Description = '1D rndm Number Vector';
% Initialize Action settings
ActionInfo = rlNumericSpec([3 1],'UpperLimit',[1;1;1],'LowerLimit',[-1;-1;-1])
ActionInfo.Name = 'Attenuation Vector';
ActionInfo.Description = '1D Attenuation Vector';
% The following line implements built-in functions of RL env
this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
% Initialize property values and pre-compute necessary values
updateActionInfo(this);
end
% Apply system dynamics and simulates the environment with the
% given action for one step.
function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
LoggedSignals = [];
% Get action
Force = getForce(this,Action);
Observation=this.State+Force;
% Update system states
this.State = Observation;
% Check terminal condition
parse=(Observation==[1 1 1]);
IsDone=all(ismember(parse,1));
this.IsDone=IsDone;
% Get reward
Reward = getReward(this,parse);
% (optional) use notifyEnvUpdated to signal that the
% environment has been updated (e.g. to update visualization)
notifyEnvUpdated(this);
end
% Reset environment to initial state and output initial observation
function InitialObservation = reset(this)
InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);
this.State = InitialObservation;
% (optional) use notifyEnvUpdated to signal that the
% environment has been updated (e.g. to update visualization)
notifyEnvUpdated(this);
end
end
%% Optional Methods (set methods' attributes accordingly)
methods
% Helper methods to create the environment
% Discrete force 1 or 2
function force = getForce(this,action)
force = action;
end
% update the action info based
function updateActionInfo(this)
% this.ActionInfo.Elements = this.ActionInfo.Elements;
end
% Reward function
function Reward = getReward(this,parse)
if this.IsDone==1
Reward = sum(parse*this.RewardForGoodShaping);
else
Reward = sum(~parse*this.PenaltyForBadShaping);
end
end
% (optional) Properties validation through set methods
function set.State(this,state)
validateattributes(state,{'numeric'},{'finite','real','vector','numel',3},'','State');
this.State = double(state);
notifyEnvUpdated(this);
end
function set.RewardForGoodShaping(this,val)
validateattributes(val,{'numeric'},{'real','finite','scalar'},'','RewardForGoodShaping');
this.RewardForGoodShaping = val;
end
function set.PenaltyForBadShaping(this,val)
validateattributes(val,{'numeric'},{'real','finite','scalar'},'','PenaltyForBadShaping');
this.PenaltyForBadShaping = val;
end
end
methods (Access = protected)
% (optional) update visualization everytime the environment is updated
% (notifyEnvUpdated is called)
function envUpdatedCallback(this)
plot(this.State)
hold off
XLimMode = 'auto';
YLimMode = 'auto';
end
end
end
This Environment validates but when i start Training with the folowing DNN and Training Options:
%%
%Create the Agent: Deep Q-Network Agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critiv value function represenntation. To create the
%critic, frist create a deep neural network with two inputs, the state and
%the action, and one output.
statePath = [
imageInputLayer([3 1], 'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(24,'Name','CriticStateFC2')];
actionPath = [
imageInputLayer([3 1],'Normalization','none','Name','action')
fullyConnectedLayer(24,'Name','CriticActionFC1')];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = addLayers(criticNetwork, commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
figure(1)
plot(criticNetwork)
%figure(2)
%hold on
%%
%Specify Options for the critic representation using
%rlRepresentationOptions
criticOpts = rlRepresentationOptions('LearnRate', 0.01, 'GradientThreshold',1, 'UseDevice',"gpu");
obsInfo=getObservationInfo(env);
actInfo=getActionInfo(env);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},criticOpts);
%%
%To Create a DDPG Agent with an Continuous action space.
actorNetwork = [
imageInputLayer([3 1],'Normalization','none','Name','state')
fullyConnectedLayer(3,'Name','action','BiasLearnRateFactor',1,'BiasInitializer','zeros','Bias',[0;0;0])];
actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},actorOpts);
%To create the DDPG agent, first specify the DDPG agent options using
%rlDDPGAgentoptions
agentOpts = rlDDPGAgentOptions(...
'SampleTime',1, ...
'TargetSmoothFactor',1e-3, ...
'ExperienceBufferLength',1e6, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOpts.NoiseOptions.Variance = 0.3;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-6;
%Then, create the DDPG Agent using the specified actor representation,
%critic representation and agent options.
agent = rlDDPGAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes', 500, ...
'MaxStepsPerEpisode', 10, ...
'Verbose', true, ...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeCount',...
'StopTrainingValue',5);
i get a big error in the training process calculating the cumulative reward. Anyway there seem to be many problems in this code i can´t figure out completly using the given examples from the toolbox.
Error using rl.agent.AbstractPolicy/step (line 116)
Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
Error in rl.env.MATLABEnvironment/simLoop (line 241)
action = step(policy,observation,reward,isdone);
Error in rl.env.MATLABEnvironment/simWithPolicyImpl (line 106)
[expcell{simCount},epinfo,siminfos{simCount}] = simLoop(env,policy,opts,simCount,usePCT);
Error in rl.env.AbstractEnv/simWithPolicy (line 70)
[experiences,varargout{1:(nargout-1)}] = simWithPolicyImpl(this,policy,opts,varargin{:});
Error in rl.task.SeriesTrainTask/runImpl (line 33)
[varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
[varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
[varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
[this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
run(trainer);
Error in rl.train.TrainingManager/run (line 160)
train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Error in DQN_Agent_for_LaserSpectrum_optimization (line 124)
trainingStats = train(agent,env,trainOpts);
Caused by:
Error using rl.agent.AbstractPolicy/step (line 103)
Error setting property 'CumulativeReward' of class 'rl.util.EpisodeInfo'. Value must be a scalar.
So i have many questions. First how to fix this error?
Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?
If Yes which kind of agent maybe the best?
Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?
Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.
Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?
Thanks in advance for the answers and sry for the very long question.
best regards
Kai
2 commentaires
Kai Tybussek
le 2 Juil 2020
Fangyuan Chang
le 31 Oct 2020
have the same issue in the defination of UpdateActionInfo...have you figured it out? Thx!
Réponse acceptée
Plus de réponses (0)
Catégories
En savoir plus sur Environments dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!