MATLAB Answers

1

Reinforcement Learning Toolbox - Intialise Experience Buffer

Asked by Enrico Anderlini on 9 May 2019
Latest activity Edited by Enrico Anderlini on 30 May 2019
I must say I am impressed by the reinforcement learning toolbox that came out with the 2019a MATLAB version. It greatly simplifies the development of reinforcement learning algorithms for control purposes. However, I have encountered difficulties in having the algorithms work with complex problems.
I am modelled the dynamics of my system in Simulink. At the moment, S-functions do not work for the computation of the reward and end conditions due to algebraic loops, which is annoying. Nevertheless, I have been able to bypass that by using Simulink blocks. My biggest problem, though, is that I cannot initialise (or even access) the experience buffer property of the DDPG agent.
The system I am modelling is a vehicle attempting to perform a particular, complex manoeuvre in three-degrees-of-freedom. As the manoeuvre is well understood, I have the benchmark of other control algorithms such as PID and NMPC. At the moment, the DDPG agent is struggling to learn (i.e. not converging) despite my playing with the network sizes, noise options and reward function. I fear this is because of the size of the search space (seven continuous states and two continuous actions). I am also using a random reset function to explore different starting points. My intention would be to use the data I can collect from multiple PID and NMPC simulations to initialise the experience. Although with the noise, the device would be still exploring the state- and action-spaces, it would at least be able to experience the higher rewards earlier in the exploration process and thus hopefully improve learning.
Any ideas if the experience buffer may be modified or initialised? At the moment, I fear it is a protected property. Could this be looked at for the next realese in addition to apprenticeship learning?
Many thanks in advance for the help! Also, if you have additional suggestions, they would be greatly appreciated!

  0 Comments

Sign in to comment.

1 Answer


Hi Enrico,
Glad to see that Reinforcement Learning Toolbox is helpful. Regarding your comment about algebraic loops, have you tried some of the methods in the following links to break the loop?
About DDPG, on this doc page, it is mentioned that
"Be patient with DDPG and DQN agents, since they may not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes."
You can see this for example in the walking robot example here, where it takes >1000 episodes to see significant progress. Also, the policy in the same example takes in 29 observation and outputs 6 continuous actions, so large search space could be a reason you are not seeing improvement, but l think there is more to it. Some suggestions:
1) Try using tanh layers to map layer inputs to -1 and 1 and then for the final layer of your actor use a "scalingLayer" to scale everything to the desired range (see actor in the example above).
2) The larger the actuation range the harder/longer it will be to learn with continuous methods. Try limiting this range to the absolutely necessary values.
3) If this range is still large, one alternative would be to learn the delta u from observations (instead of the absolute values). The du range is inherently small so that could help limit the search space.
4) Make sure the exploration variance in DDPG options is comparable to your actuation range (not larger though otherwise you will not learn anything).
5) A reasonable first approach for reward design is to use the MPC cost as a starting point, or the error in case of PID.
The idea of using knowledge from traditional controllers to initialize the policy makes a lot of sense. While (I believe) it is not possible in 19a to initialize the experience buffer (unless you are training an existing agent and have saved the previous experience buffer), some alternatives could be to:
1) Create a dataset from many simulations using traditional PID/MPC and train an actor network with supervised learning to behave as a controller.
2) Instead of supervised learning, have the MPC/PID controller in the loop during training, and shape your reward to be based on the error between the traditional controller and RL agent.
Then you can use this network as initial value for further training with reinforcement learning. I hope this helps.

  1 Comment

sorry, but I have just seen your answer.
Firstly, many thanks for the help on the algebraic loop: it was my fault and not the reinforcement learning's block's. Now, that works and it is was smoother than using multiple Simulink blocks.
The best way to avoid this problem when using the reinforcement learning block in Simulink is to use a delay or memory block for the action signal.
Since my question, I also managed to get the algorithms to work in a similar fashion to what you recommend. You are right: for very large problems training time is very large. Changing the variance turned out to be quite useful in my case.
Thank you also very much fo the tips on initialising the learning process! I will give it a go and hopefully will be successful.

Sign in to comment.