How to create an custom Reinforcement Learning Environment + DDPG agent

Hello,
i´m working on an Agent for a problem in the spectral domain. I want to dump frequencies in a spectrum in a way that the resulting spectrum is looking like a rect() function.
So i created the following Environment with a [3 1] continuous Observation and Actionspace as an abstract Version of the real Problem. The initial Observation is a rndm [3 1] vector with values between 0 and 2. The Actionspace is a [3 1] vector with values between -1 and 1. The reward function is maximal when every vector element equals a final vector like[1 1 1].
the Environment file looks like this:
classdef SimpleContinuousEnv < rl.env.MATLABEnvironment
%SIMPLETESTENV: Template for defining custom environment in MATLAB.
%% Properties (set properties' attributes accordingly)
properties
% Specify and initialize environment's necessary properties
MaxForce=1
% Reward for a good correlation between observation and
% targetfunction
RewardForGoodShaping = 10
% Penalty for a bad correlation beween observation and
% targetfunction
PenaltyForBadShaping = -10
end
properties
% Initialize system state
State = zeros(1,3)
end
properties(Access = protected)
% Initialize internal flag to indicate episode termination
IsDone = false
end
%% Necessary Methods
methods
% Contructor method creates an instance of the environment
% Change class name and constructor name accordingly
function this = SimpleContinuousEnv()
% Initialize Observation settings
ObservationInfo = rlNumericSpec([3 1],'UpperLimit',[5;5;5],'LowerLimit',[-5;-5;-5]);
ObservationInfo.Name = 'Test SpectraObs';
ObservationInfo.Description = '1D rndm Number Vector';
% Initialize Action settings
ActionInfo = rlNumericSpec([3 1],'UpperLimit',[1;1;1],'LowerLimit',[-1;-1;-1])
ActionInfo.Name = 'Attenuation Vector';
ActionInfo.Description = '1D Attenuation Vector';
% The following line implements built-in functions of RL env
this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
% Initialize property values and pre-compute necessary values
updateActionInfo(this);
end
% Apply system dynamics and simulates the environment with the
% given action for one step.
function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
LoggedSignals = [];
% Get action
Force = getForce(this,Action);
Observation=this.State+Force;
% Update system states
this.State = Observation;
% Check terminal condition
parse=(Observation==[1 1 1]);
IsDone=all(ismember(parse,1));
this.IsDone=IsDone;
% Get reward
Reward = getReward(this,parse);
% (optional) use notifyEnvUpdated to signal that the
% environment has been updated (e.g. to update visualization)
notifyEnvUpdated(this);
end
% Reset environment to initial state and output initial observation
function InitialObservation = reset(this)
InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);
this.State = InitialObservation;
% (optional) use notifyEnvUpdated to signal that the
% environment has been updated (e.g. to update visualization)
notifyEnvUpdated(this);
end
end
%% Optional Methods (set methods' attributes accordingly)
methods
% Helper methods to create the environment
% Discrete force 1 or 2
function force = getForce(this,action)
force = action;
end
% update the action info based
function updateActionInfo(this)
% this.ActionInfo.Elements = this.ActionInfo.Elements;
end
% Reward function
function Reward = getReward(this,parse)
if this.IsDone==1
Reward = sum(parse*this.RewardForGoodShaping);
else
Reward = sum(~parse*this.PenaltyForBadShaping);
end
end
% (optional) Properties validation through set methods
function set.State(this,state)
validateattributes(state,{'numeric'},{'finite','real','vector','numel',3},'','State');
this.State = double(state);
notifyEnvUpdated(this);
end
function set.RewardForGoodShaping(this,val)
validateattributes(val,{'numeric'},{'real','finite','scalar'},'','RewardForGoodShaping');
this.RewardForGoodShaping = val;
end
function set.PenaltyForBadShaping(this,val)
validateattributes(val,{'numeric'},{'real','finite','scalar'},'','PenaltyForBadShaping');
this.PenaltyForBadShaping = val;
end
end
methods (Access = protected)
% (optional) update visualization everytime the environment is updated
% (notifyEnvUpdated is called)
function envUpdatedCallback(this)
plot(this.State)
hold off
XLimMode = 'auto';
YLimMode = 'auto';
end
end
end
This Environment validates but when i start Training with the folowing DNN and Training Options:
%%
%Create the Agent: Deep Q-Network Agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critiv value function represenntation. To create the
%critic, frist create a deep neural network with two inputs, the state and
%the action, and one output.
statePath = [
imageInputLayer([3 1], 'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(24,'Name','CriticStateFC2')];
actionPath = [
imageInputLayer([3 1],'Normalization','none','Name','action')
fullyConnectedLayer(24,'Name','CriticActionFC1')];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = addLayers(criticNetwork, commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
figure(1)
plot(criticNetwork)
%figure(2)
%hold on
%%
%Specify Options for the critic representation using
%rlRepresentationOptions
criticOpts = rlRepresentationOptions('LearnRate', 0.01, 'GradientThreshold',1, 'UseDevice',"gpu");
obsInfo=getObservationInfo(env);
actInfo=getActionInfo(env);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},criticOpts);
%%
%To Create a DDPG Agent with an Continuous action space.
actorNetwork = [
imageInputLayer([3 1],'Normalization','none','Name','state')
fullyConnectedLayer(3,'Name','action','BiasLearnRateFactor',1,'BiasInitializer','zeros','Bias',[0;0;0])];
actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},actorOpts);
%To create the DDPG agent, first specify the DDPG agent options using
%rlDDPGAgentoptions
agentOpts = rlDDPGAgentOptions(...
'SampleTime',1, ...
'TargetSmoothFactor',1e-3, ...
'ExperienceBufferLength',1e6, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOpts.NoiseOptions.Variance = 0.3;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-6;
%Then, create the DDPG Agent using the specified actor representation,
%critic representation and agent options.
agent = rlDDPGAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes', 500, ...
'MaxStepsPerEpisode', 10, ...
'Verbose', true, ...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeCount',...
'StopTrainingValue',5);
i get a big error in the training process calculating the cumulative reward. Anyway there seem to be many problems in this code i can´t figure out completly using the given examples from the toolbox.
Error using rl.agent.AbstractPolicy/step (line 116)
Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
Error in rl.env.MATLABEnvironment/simLoop (line 241)
action = step(policy,observation,reward,isdone);
Error in rl.env.MATLABEnvironment/simWithPolicyImpl (line 106)
[expcell{simCount},epinfo,siminfos{simCount}] = simLoop(env,policy,opts,simCount,usePCT);
Error in rl.env.AbstractEnv/simWithPolicy (line 70)
[experiences,varargout{1:(nargout-1)}] = simWithPolicyImpl(this,policy,opts,varargin{:});
Error in rl.task.SeriesTrainTask/runImpl (line 33)
[varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
[varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
[varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
[this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
run(trainer);
Error in rl.train.TrainingManager/run (line 160)
train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Error in DQN_Agent_for_LaserSpectrum_optimization (line 124)
trainingStats = train(agent,env,trainOpts);
Caused by:
Error using rl.agent.AbstractPolicy/step (line 103)
Error setting property 'CumulativeReward' of class 'rl.util.EpisodeInfo'. Value must be a scalar.
So i have many questions. First how to fix this error?
Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?
If Yes which kind of agent maybe the best?
Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?
Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.
Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?
Thanks in advance for the answers and sry for the very long question.
best regards
Kai

2 commentaires

and how do define UpdateActionInfo within the environment when the actionspace is continuous?
have the same issue in the defination of UpdateActionInfo...have you figured it out? Thx!

Connectez-vous pour commenter.

 Réponse acceptée

Hi Kai,
What the very first error is telling you is that there is an issue with the dimensions of either your observation, reward, isdone or loggedSignals. In fact if you check the lines
% Check terminal condition
parse=(Observation==[1 1 1]);
IsDone=all(ismember(parse,1));
this.IsDone=IsDone;
in your environment you will see that you are assigning a vector to IsDone, but you IsDone is supposed to be a scalar. I changed it to a scalar and training started properly (cannot comment on the other hyperparameters of the problem).
Some more answers to your questions:
Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?
If Yes which kind of agent maybe the best?
Not sure what you mean here. The time step and agent sample time are determined on a case by case depending on the problem you work with.
Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?
Reinforcement learning does not typically consider hard constraints in the problem formulation, so if you have constraints in your problem you would probably need to treat them as soft and add penalties in your reward signal if they are violated.
Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.
actInfo = rlNumericSpec([n 1],'LowerLimit',0,'UpperLimit',35);%for continous action spaces
See this example if your inputs are discrete.
Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?
There is no out-of-the-box functionality for this yet so you would have to implement the communication part yourself (but we are actively working on this).
and how do define UpdateActionInfo within the environment when the actionspace is continuous?
For the UpdateActionInfo question, this is only called once in the constructor of the environment class so it is certainly not necessary to implement if you set up the actionspace otherwise.
Hope this helps

9 commentaires

When i changed IsDone to a skalar training windows pops up but i get an error as soon as training starts. It seems a mismatch of dimensions happening during training after a few steps?!
Error using rl.representation.rlAbstractRepresentation/gradient (line 181)
Unable to compute gradient from representation.
Unable to evaluate the loss function. Check the loss function and ensure it runs successfully.
Error using rl.representation.rlAbstractRepresentation/validateInputData (line 500)
Input data dimensions must match the dimensions specified in the corresponding observation and action info specifications.
"There is no out-of-the-box functionality for this yet so you would have to implement the communication part yourself "
Ok, but how is there an explanation how to implement this communication? can i simply implement these communication functions in the environment ? like in the step function reading a com port for next observation?
The only changes I made to run your environment were to set IsDone to 'false' and I also removed the ''UseDevice',"gpu"' option since I don't have a GPU.
You can see how the predefined environmenrs are implemented if you run 'edit rl.env+TAB key' in the command prompt. You will then get a few options for autocomplete. Note that these are implemented as object-oriented classes, so you may need to go deeper to see some implementation details.
I do not have enough information to help with the communication/hw part, but effectively you would need to modify the 'step' method to apply an action to the real system (through whatever communication you have implemented), and then read information/observation values from your system (whether that's from sensors or something else) to calculate the reward.
Kai Tybussek
Kai Tybussek le 3 Juil 2020
Modifié(e) : Kai Tybussek le 3 Juil 2020
but how to fix the error? can´t get rid of it. The first 3 Episodes are running and then i get the error! i updated the code like i changd it
Kai Tybussek
Kai Tybussek le 3 Juil 2020
Modifié(e) : Kai Tybussek le 3 Juil 2020
ok its always happening when the first minibatch is finished. when i do 5000 episodes with 1 step and a batchsize of 512 its happening after 511 episodes
Error using rl.agent.AbstractPolicy/step (line 116)
Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
Error in rl.env.MATLABEnvironment/simLoop (line 241)
action = step(policy,observation,reward,isdone);
Error in rl.env.MATLABEnvironment/simWithPolicyImpl (line 106)
[expcell{simCount},epinfo,siminfos{simCount}] = simLoop(env,policy,opts,simCount,usePCT);
Error in rl.env.AbstractEnv/simWithPolicy (line 70)
[experiences,varargout{1:(nargout-1)}] = simWithPolicyImpl(this,policy,opts,varargin{:});
Error in rl.task.SeriesTrainTask/runImpl (line 33)
[varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
[varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
[varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
[this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
run(trainer);
Error in rl.train.TrainingManager/run (line 160)
train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Error in RFL_Agents_for_LaserSpectrum_optimization (line 133)
trainingStats = train(agent,env,trainOpts);
Caused by:
Error using rl.representation.rlAbstractRepresentation/gradient (line 181)
Unable to compute gradient from representation.
Unable to evaluate the loss function. Check the loss function and ensure it runs successfully.
Error using rl.representation.rlAbstractRepresentation/validateInputData (line 500)
Input data dimensions must match the dimensions specified in the corresponding observation and action info specifications.
Ok after i switched from gpu to cpu the vector mismatch was gone and training startet as u described.
I found out that the gpu arrays have been 2 single and just one double gpu array. Is this a problem in my code or a problem in the toolbox? with cpu calculation there was no vector mismatch.
Are you using R2020a to train? If yes have a look at this answer. There was a bug which was fixed in Update 2.
i updated to Update 3 but still got an error when i train with GPU and no error with CPU.
"InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);"
i think this may cause trouble in gpu calculation. When i set breakpoints one gpu array is single the other array is double.
I think it would be better to contact technical support at this point and provide the exact reproduction model and error you are seeing. They would be able to get in touch with the development team if necessary.

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by