Create Custom Grid World Environments
A custom grid world environment is a MATLAB® environment featuring a generic two-dimensional grid with actions, observations, rewards, dynamics, and optional obstacles and terminal states that are mostly left for you to define. As in any grid world environment, the goal of the agent is to move in a way to maximize its expected discounted cumulative long-term reward.
Grid world environments are a special case of Markov Decision Process (MDP) environments.
An MDP is a discrete time stochastic control process. It provides a mathematical framework for
modeling decision making in situations where outcomes are partly random and partly under the
control of the decision maker. In a grid world environment, the state represents a position in
a two-dimensional grid, while the action represents a move from the current position to the
next, which an agent might attempt. To create a custom MDP environment, see Markov Decision Process (MDP) Environments, createMDP
, and
rlMDPEnv
.
You can use a custom grid world environment to analyze the behavior of different discrete-time agents on custom grid worlds, and to explore reinforcement learning concepts. For example, many common benchmark reinforcement learning problems are grid world problems, and you can study them with Reinforcement Learning Toolbox™ by creating a custom grid world environments.
To create a custom grid world environment:
Create the grid world object.
Configure the grid world object.
Use the grid world object to create your environment.
To load a grid world environment with predefined actions, observations, rewards, and dynamics, see Load Predefined Grid World Environments.
Create Grid World Object
You can create your own grid world model using the createGridWorld
function. Specify the grid size when creating the GridWorld
object.
For example, at the MATLAB command line, type:
GW = createGridWorld(6,6,"Standard")
GW = GridWorld with properties: GridSize: [6 6] CurrentState: "[1,1]" States: [36×1 string] Actions: [4×1 string] T: [36×36×4 double] R: [36×36×4 double] ObstacleStates: [0×1 string] TerminalStates: [0×1 string] ProbabilityTolerance: 8.8818e-16
Note
The grid world model GW
is a GridWorld
object,
not an environment object. You must later create an rlMDPEnv
environment object from GW
.
The GridWorld
object has these properties.
Property | Read-Only | Description | ||||||
---|---|---|---|---|---|---|---|---|
GridSize | Yes | Dimensions of the grid world, displayed as a row vector containing two
positive integers. The first integer | ||||||
CurrentState | No | Name of the current state of the environment. This name corresponds to
the current agent position in the grid, and it is specified as a string or
character vector such as You can use
this property to set the initial state of the environment. For example, the
command If you call the step
function on an environment built using | ||||||
States | Yes | A string vector containing the state names of the grid world, as
specified in the GW.States = ["[1,1]"; "[2,1]"; "[1,2]"; "[2,2]"]; | ||||||
Actions | Yes | A string vector containing the list of possible actions that the agent
can execute in the grid world environment. You can set the actions when you create
the grid world model by using the For example, at the MATLAB command line, type: GW = createGridWorld(m,n,moves) Here,
| ||||||
T | No | State transition matrix, specified as a 3-D array in which every row of each page contains nonnegative numbers that must add up to one. The
state transition matrix
When you create a grid world object, the default transition matrix contains standard deterministic transitions corresponding to the four or eight actions that the agent can execute. Specifically, the default transition matrix is such that any attempted move in any direction results in the agent moving one cell in that direction with probability of one, except for any attempted move outside the grid, which results in the agent keeping its current position. For
example, consider a 5-by-5 deterministic grid world object northStateTransition = GW.T(:,:,1) Here,
the number In this figure, the value of
Note Since each number in one row represents a probability of moving into a specific cell, all the numbers along a row must add to either one or zero.
. | ||||||
R | No | Reward transition matrix, specified as a 3-D array.
Each entry of the reward
transition matrix specifies the reward that the agent obtains when moving from the
current state
When you create a grid world object, the reward matrix is zero. Set up | ||||||
ObstacleStates | No |
This syntax specifies the obstacle states, represented by black squares in the figure. GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"]; When
you set obstacle states, the transition matrix | ||||||
TerminalStates | No |
GW.TerminalStates = "[5,5]"; When
you set terminal states, the transition and reward matrices of
|
Configure Grid World Object
After creating your GridWorld
object, you need to configure its
transition matrix, to make sure it represents your desired dynamics. You also need to
configure its reward matrix to make sure the agent gets the appropriate rewards for its
moves.
Because each row of each page of the transition matrix must always sum to one, you
cannot modify the transition matrix entries in place one at a time. Instead, assign the
default matrix to a temporary variable, in the workspace, modify the variable entries
appropriately, and then reassign the modified variable to the transition matrix of your
GridWorld
object.
For example, create a GridWorld
object with five rows and five
columns.
gw = createGridWorld(5,5);
T = gw.T;
% Zero out the existing transitions from state 6. T(6,:,:) = 0; % For any action, set the probability of reaching state 10 to 1. T(6,10,1) = 1; T(6,10,2) = 1; T(6,10,3) = 1; T(6,10,4) = 1;
GridWorld
object.gw.T = T;
Create Grid World Environment from Grid World Object
After configuring your GridWorld
object, use it to create an MDP
environment using rlMDPEnv
. This step
is necessary because the GridWorld
object is not an environment
object.
For example, if you have the GridWorld
object gw
in
the MATLAB workspace, at the command line,
type:
env = rlMDPEnv(gw)
env = rlMDPEnv with properties: Model: [1×1 rl.env.GridWorld] ResetFcn: []
This command creates the environment env
that contains your
GridWorld
object.
env.Model
ans = GridWorld with properties: GridSize: [5 5] CurrentState: "[1,1]" States: [25×1 string] Actions: [4×1 string] T: [25×25×4 double] R: [25×25×4 double] ObstacleStates: [0×1 string] TerminalStates: [0×1 string] ProbabilityTolerance: 8.8818e-16
ResetFcn
environment property to the handle of an anonymous
function that always returns
2.env.ResetFcn = @() 2
env = rlMDPEnv with properties: Model: [1×1 rl.env.GridWorld] ResetFcn: @()2
For more information on the reset function, see Reset Function.
Environment Visualization
As with other grid world environments, you can visualize the environment using the
plot
function. A red circle represents the current agent position,
that is, the environment state. If present, the terminal locations and obstacles are
represented by blue and black squares, respectively.
plot(env)
Note
Visualizing the environment during training can provide insight, but it tends to increase training time. For faster training, keep the environment plot closed during training.
Actions
Depending on the Actions
property of the underlying
GridWorld
model, the action channel carries a scalar integer ranging
from either 1 to 4 or 1 to 8.
When
Actions
is set to"Standard"
, the integer indicates an (attempted) move in the directions north, south, east, or west, respectively.When
Actions
is set to"Kings"
, the integer indicates an (attempted) move in the directions north, south, east, west, northeast, northwest, southeast and southwest, respectively.
For more information, see Create Grid World Object.
In either case, the action specification is an rlFiniteSetSpec
object. To extract the action specification, use getActionInfo
.
actInfo = getActionInfo(env)
actInfo = rlFiniteSetSpec with properties: Elements: [4×1 double] Name: "MDP Actions" Description: [0×0 string] Dimension: [1 1] DataType: "double"
Observations
As in all grid world environments, the environment observation has a single channel
carrying a scalar integer from 1 to the number of environment states. The observation
indicates the current agent location (that is, the environment state) in column-wise
fashion. So, the observation specification is an rlFiniteSetSpec
object. To extract the observation specification, use getObservationInfo
.
obsInfo = getObservationInfo(env)
obsInfo = rlFiniteSetSpec with properties: Elements: [56×1 double] Name: "MDP Observations" Description: [0×0 string] Dimension: [1 1] DataType: "double"
Grid World Dynamics
As for all grid world environments, the transition matrix property
T
of the underlying GridWorld
object determines
the dynamics.
The default transition matrix is such that any attempted move in any direction results in the agent moving one cell in that direction with a probability of one, except for any attempted move outside the grid, which results in the agent maintaining its current position.
For more information, see Create Grid World Object and Configure Grid World Object.
Rewards
As for all grid world environments, the reward matrix property R
of the underlying GridWorld
object determines the reward.
The default reward matrix contains only zeroes.
For more information, see Create Grid World Object and Configure Grid World Object.
Reset Function
The state of a custom grid world environment is initially set to 1
,
which is equivalent to the string "[1,1]"
, representing the most
northwestern position of the grid. The default reset function for a custom grid world
environment then sets the initial environment state (that is, the initial position of the
agent on the grid), randomly.
x0 = reset(env)
x0 = 12
You can write your own reset function to specify a different initial state. For
example, to specify that the initial state of the environment is always 5, create a reset
function that always returns 3
, and set the
ResetFcn
property to the handle of the function.
env.ResetFcn = @() 3;
A training or simulation function automatically calls the reset function at the beginning of each training or simulation episode.
Create a Default Agent for this Environment
The environment observation and action specifications allow you to create an agent (with discrete action space) that works with your environment. For example, create a default AC agent.
acAgent = rlACAgent(obsInfo,actInfo)
acAgent = rlACAgent with properties: AgentOptions: [1×1 rl.option.rlACAgentOptions] UseExplorationPolicy: 1 ObservationInfo: [1×1 rl.util.rlFiniteSetSpec] ActionInfo: [1×1 rl.util.rlFiniteSetSpec] SampleTime: 1
If needed, modify the agent options using dot notation.
acAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3; acAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
You can now use both the environment and the agent as arguments for the built-in
functions train
and
sim
, which train
or simulate the agent within the environment.
You can also create and train agents for this environment interactively using the Reinforcement Learning Designer app. For an example, see Design and Train Agent Using Reinforcement Learning Designer.
For more information on creating agents, see Reinforcement Learning Agents.
Step Function
As in other MATLAB environments, you can also call the environment step function to return the
next observation, reward, and an is-done
scalar indicating whether the
environment reaches a final state.
For example, call the step function with an action input of 2
to
move the agent south.
[xn,rn,id]=step(env,3)
xn = 4 rn = 0 id = logical 0
The environment step
and reset
functions
allow you to create a custom training or simulation loop. For more information on custom
training loops, see Train Reinforcement Learning Policy Using Custom Training Loop.
See Also
Functions
Objects
rlMDPEnv
|rlNumericSpec
|rlFiniteSetSpec
|rlFunctionEnv
|rlMultiAgentFunctionEnv
|rlTurnBasedFunctionEnv
|SimulinkEnvWithAgent