Policy Gradient with Baseline Reward Oscillation (MATLAB Reinforcement Learning Toolbox)
12 views (last 30 days)
Show older comments
I'm trying to train a Policy Gradient Agent with Baseline for my RL research. I'm using the in-built RL toolbox from MATLAB (https://www.mathworks.com/help/reinforcement-learning/ug/pg-agents.html) and have created my own Environment following the MATLAB documention.
The goal is to train the system to sample an underlying time-series (
) given battery constrains (
is battery cost). I also have a prediction model which outputs
given exogenous input time series
.




Environment setup:
- Geophyiscal time series to be sampled:
- State/Obs. time series:
- X includes some exogenous time-series, along with some system info such as battery level, date/time, etc.
- N ~ 3 years of houlry data (30k)
- A binary Action is taken at each time-step t. If
= 0 keep a model prediction
; If
= 1 sample the "true" time series (y)
Epsiodes setup:
- Get a random snipet of each time-series
and
each with length
- Espiode start and end lenghts,
, are currently randomly set at the begining of each episode. Overlaps are ok.
- Randomly set the system's initial battery level.
- State/Obs. time series:
- **A each timestep t the policy receives an inputs
and should determine
.**
- Epsidoes end if the time-step reaches the end of the time-series
or the system runs our of battery.
Reward function is 

- Where
is the RMSE error between the sampled time series
, and true time series y.
- The Terminal State Rewards are T1 = -100 if sensor runs out of battery. T2 = 100 if reached the end of the episode with RMSE < threshold and some battery level remains. The goal is to always end in T2.
RL Code:
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1); % 13x1
actInfo = getActionInfo(env);
numActions = numel(actInfo.Elements); % 2x1
learing_rate = 1e-4;
% Actor Network
ActorNetwork_ = [
imageInputLayer([numObservations 1 1],'Normalization','none','Name','state')
fullyConnectedLayer(32,'Name','fc1')
reluLayer
fullyConnectedLayer(16,'Name','fc2')
reluLayer
fullyConnectedLayer(numActions,'Name','action')];
actorOpts = rlRepresentationOptions('LearnRate',learing_rate,'GradientThreshold',1);
ActorNetwork = rlRepresentation(ActorNetwork_,obsInfo,actInfo,'Observation',{'state'},...
'Action',{'action'},actorOpts);
% Critic Network
CriticNetwork_ = [
imageInputLayer([numObservations 1 1], 'Normalization', 'none', 'Name', 'state')
fullyConnectedLayer(32,'Name','fc1')
reluLayer
fullyConnectedLayer(16,'Name','fc2')
reluLayer
fullyConnectedLayer(1,'Name','action')];
baselineOpts = rlRepresentationOptions('LearnRate',learing_rate,'GradientThreshold',1);
CriticNetwork = rlRepresentation(CriticNetwork_,baselineOpts,'Observation',{'state'},obsInfo);
agentOpts = rlPGAgentOptions(...
'UseBaseline',true, ...
'DiscountFactor', 0.99,...
'EntropyLossWeight',0.2);
agent = rlPGAgent(ActorNetwork,CriticNetwork,agentOpts);
validateEnvironment(env)
%
warning('off','all')
trainOpts = rlTrainingOptions(...
'MaxEpisodes', 2500, ...
'MaxStepsPerEpisode', envConstants.MaxEpsiodeStesp, ...
'Verbose', true, ...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',100,...
'ScoreAveragingWindowLength',20,...
'SaveAgentDirectory',save_path,...
'SaveAgentCriteria','AverageReward',...
'SaveAgentValue',-50,...
'UseParallel',true);
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Gradients';
trainingStats = train(agent,env,trainOpts);
My current setup is using mostly default RL setups from MALTAB with learning rate of 1e-4 and ADAM optimizer. The training is slow, and shows a lot of Reward oscillation between the two terminal states. MATLAB RL toolbox also outputs a
value which the state is:

Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, Episode Q0 should approach the true discounted long-term reward if the critic is well-designed.
**Questions**
- Are my training and episodes too random? i.e., time-series of different lengths and random initial sensor setup.
- Should I simplify my reward function to be just T2? (probably not)
- Why doesn't Q0 change at all?
- Why not use DQN? I'll give that a try as well.

0 Comments
Answers (1)
Emmanouil Tzorakoleftherakis
on 19 Mar 2020
Hello,
Some suggestions:
1) For a 13 to 2 mapping, maybe you need another set of FCl+Relu layers in your actor
2) Since you have discrete action space, have you considered trying DQN instead? PG is Monte Carlo based so training will be slower
3) I wouldn't reduce the reward to just T2 - that would make it very sparse and would make training harder
In terms of randomness, it's not clear to me how the time-series is processed in the environment/how the environment works. How often does the actor take an action? You mentioned your observations are 13x1, so does this mean that you have a 12xlength time series of data coming into the sensor? At each time step, the policy should receive 13 values so I am trying to understand how the time-series is being processed by the policy.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!