Create Custom Grid World Environments

A custom grid world environment is a MATLAB^® environment featuring a generic two-dimensional grid with actions, observations, rewards, dynamics, and optional obstacles and terminal states that are mostly left for you to define. As in any grid world environment, the goal of the agent is to move in a way to maximize its expected discounted cumulative long-term reward.

Grid world environments are a special case of Markov Decision Process (MDP) environments. An MDP is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of the decision maker. In a grid world environment, the state represents a position in a two-dimensional grid, while the action represents a move from the current position to the next, which an agent might attempt. To create a custom MDP environment, see Markov Decision Process (MDP) Environments, createMDP, and rlMDPEnv.

Basic 5-by-5 grid world with the agent indicated by a red circle in the top left corner, the terminal location indicated by a blue square in the bottom right corner, and four obstacle squares indicated by black squares in the middle.

You can use a custom grid world environment to analyze the behavior of different discrete-time agents on custom grid worlds, and to explore reinforcement learning concepts. For example, many common benchmark reinforcement learning problems are grid world problems, and you can study them with Reinforcement Learning Toolbox™ by creating a custom grid world environments.

To create a custom grid world environment:

Create the grid world object.
Configure the grid world object.
Use the grid world object to create your environment.

To load a grid world environment with predefined actions, observations, rewards, and dynamics, see Load Predefined Grid World Environments.

Create Grid World Object

You can create your own grid world model using the createGridWorld function. Specify the grid size when creating the GridWorld object.

For example, at the MATLAB command line, type:

GW = createGridWorld(6,6,"Standard")

GW = 

  GridWorld with properties:

                GridSize: [6 6]
            CurrentState: "[1,1]"
                  States: [36×1 string]
                 Actions: [4×1 string]
                       T: [36×36×4 double]
                       R: [36×36×4 double]
          ObstacleStates: [0×1 string]
          TerminalStates: [0×1 string]
    ProbabilityTolerance: 8.8818e-16

Note

The grid world model GW is a GridWorld object, not an environment object. You must later create an rlMDPEnv environment object from GW.

The GridWorld object has these properties.

Property Read-Only Description

GridSize

Yes

Dimensions of the grid world, displayed as a row vector containing two positive integers. The first integer m is the number of rows in the grid, and the second integer n is the number of columns in the grid.

CurrentState

Name of the current state of the environment. This name corresponds to the current agent position in the grid, and it is specified as a string or character vector such as "[a,b]". Here, a and b are two positive integers less than m and n, respectively, that indicate the row (a) and the column (b) corresponding to the agent position on the grid. Specifying this property in any other format results in an error when the environment step function is executed on an environment built using GW. By default, this property is set to the string "[1,1]".

You can use this property to set the initial state of the environment. For example, the command GW.CurrentState = "[2,4]"; sets the current position of the agent in the cell located in the second row and the fourth column of the grid. On an 8-by-7 grid this position is encoded as environment state number 26, using the formula 8*(4-1) + 2 = 26.

If you call the step function on an environment built using GW, the environment executes the function from the state indicated by CurrentState. Note that every time the environment reset function is called, the environment state is reset according to the specific code in the reset function. For more information on step and reset functions, see Create Custom Environment Using Step and Reset Functions.

States

Yes

A string vector containing the state names of the grid world, as specified in the CurrentState property. For example, for a 2-by-2 grid world model GW, you can specify the state names as follows.

GW.States = ["[1,1]";
             "[2,1]";
             "[1,2]";
             "[2,2]"];

Actions

Yes

A string vector containing the list of possible actions that the agent can execute in the grid world environment. You can set the actions when you create the grid world model by using the moves argument.

For example, at the MATLAB command line, type:

GW = createGridWorld(m,n,moves)

Here, m and n are integers as specified in GridSize, and moves is a string that can be either "Standard" or "Kings".

moves GW.Actions

"Standard" ["N";"S";"E";"W"], indicates that the agent can attempt to move north, south, east, and west from its current grid position. The step function of the environment built using GW encodes these moves using integers from 1 to 4. For example, step(env,3) indicates that the agent attempts to move east from its current position.

"Kings" ["N";"S";"E";"W";"NE";"NW";"SE";"SW"] indicating that the agent can attempt to move north, south, east, west, northeast, northwest, southeast and southwest, respectively, from its current grid position. The step function of the environment built using GW encodes these moves using integers from 1 to 8, so that, for example, step(env,8) indicates that the agent attempts to move southwest from its current position.

T

State transition matrix, specified as a 3-D array in which every row of each page contains nonnegative numbers that must add up to one.

The state transition matrix T is a probability matrix in which each entry indicates the likelihood of the agent moving from the current state s to any possible next state s' by performing action a.

Dimensions of the transition matrix

T can be denoted as

$T (s, s', a) = p r o b a b i l i t y (s' | s, a)$

T is:

A K-by-K-by-4 array, if moves is specified as "Standard". Here, K = m*n.
A K-by-K-by-8 array, if moves is specified as "Kings".

When you create a grid world object, the default transition matrix contains standard deterministic transitions corresponding to the four or eight actions that the agent can execute. Specifically, the default transition matrix is such that any attempted move in any direction results in the agent moving one cell in that direction with probability of one, except for any attempted move outside the grid, which results in the agent keeping its current position.

For example, consider a 5-by-5 deterministic grid world object GW with the agent in cell [3,1]. View the state transition matrix for the north direction.

northStateTransition = GW.T(:,:,1)

Here, the number 1 in the last dimension encodes the attempted move north, as specified in the Actions property.

Basic five-by-five grid world showing the agent position that moves north

In this figure, the value of northStateTransition(3,2) is 1. This value indicates that when the agent is in the position [3,1], following the action 'N', the agent moves to the position [2,1] with a probability of 1 (and with a probability of 0 to the other cells specified in the same row).

Note

Since each number in one row represents a probability of moving into a specific cell, all the numbers along a row must add to either one or zero.

R

Reward transition matrix, specified as a 3-D array. R determines how much reward the agent receives after performing an action in the environment.

Each entry of the reward transition matrix specifies the reward that the agent obtains when moving from the current state s to any possible next state s' by performing action a:

$r = R (s, s', a) .$

R has the same shape and size as the state transition matrix T. Specifically, R is:

A K-by-K-by-4 array, if moves is specified as "Standard". Here, K = m*n.
A K-by-K-by-8 array, if moves is specified as "Kings".

When you create a grid world object, the reward matrix is zero.

Set up R so that there is a reward to the agent after every action. For example, you can set up a positive reward if the agent transitions over obstacle states and when it reaches the terminal state. You can also set up a reward of –1 for any action the agent takes, independent of the current state and next state.

ObstacleStates

ObstacleStates are states that cannot be reached in the grid world, specified as a string vector. Consider this 5-by-5 grid world object GW.

This syntax specifies the obstacle states, represented by black squares in the figure.

GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];

When you set obstacle states, the transition matrix T automatically updates so that if the agent attempts to move into an obstacle, its resulting position is the same as its current position, with a probability of one.

TerminalStates

TerminalStates are the final states in the grid world, specified as a string vector. Consider the picture of the previous 5-by-5 grid world model GW. The blue cell is the terminal state and you can specify it with this command.

GW.TerminalStates = "[5,5]";

When you set terminal states, the transition and reward matrices of GW are not automatically update. However, suppose you use GW to create the MDP environment env. Then, when as a result of calling the env step function with an appropriate action input, the agent moves into a terminal state, the step function returns an is-done value of true as its third output argument. The fact that is-done becomes true implies that, for example, when you use train or sim, the training or simulation episode stops when the agent reaches the terminal state.

Configure Grid World Object

After creating your GridWorld object, you need to configure its transition matrix, to make sure it represents your desired dynamics. You also need to configure its reward matrix to make sure the agent gets the appropriate rewards for its moves.

Because each row of each page of the transition matrix must always sum to one, you cannot modify the transition matrix entries in place one at a time. Instead, assign the default matrix to a temporary variable, in the workspace, modify the variable entries appropriately, and then reassign the modified variable to the transition matrix of your GridWorld object.

For example, create a GridWorld object with five rows and five columns.

gw = createGridWorld(5,5);

Extract the default transition matrix, which already contains the standard transition dynamics.

T = gw.T;

Modify the temporary matrix so that from state 6 any action leads to state 10.

% Zero out the existing transitions from state 6.
T(6,:,:) = 0;
% For any action, set the probability of reaching state 10 to 1.
T(6,10,1) = 1; T(6,10,2) = 1; T(6,10,3) = 1; T(6,10,4) = 1;

Update the transition matrix of your GridWorld object.

gw.T = T;

Create Grid World Environment from Grid World Object

After configuring your GridWorld object, use it to create an MDP environment using rlMDPEnv. This step is necessary because the GridWorld object is not an environment object.

For example, if you have the GridWorld object gw in the MATLAB workspace, at the command line, type:

env = rlMDPEnv(gw)

env = 

  rlMDPEnv with properties:

       Model: [1×1 rl.env.GridWorld]
    ResetFcn: []

This command creates the environment env that contains your GridWorld object.

env.Model

ans = 

  GridWorld with properties:

                GridSize: [5 5]
            CurrentState: "[1,1]"
                  States: [25×1 string]
                 Actions: [4×1 string]
                       T: [25×25×4 double]
                       R: [25×25×4 double]
          ObstacleStates: [0×1 string]
          TerminalStates: [0×1 string]
    ProbabilityTolerance: 8.8818e-16

If necessary, you can set your own reset function. For example, to make sure the agent always starts from state number 2, set the ResetFcn environment property to the handle of an anonymous function that always returns 2.

env.ResetFcn = @() 2

env = 

  rlMDPEnv with properties:

       Model: [1×1 rl.env.GridWorld]
    ResetFcn: @()2

For more information on the reset function, see Reset Function.

Environment Visualization

As with other grid world environments, you can visualize the environment using the plot function. A red circle represents the current agent position, that is, the environment state. If present, the terminal locations and obstacles are represented by blue and black squares, respectively.

plot(env)

Note

Visualizing the environment during training can provide insight, but it tends to increase training time. For faster training, keep the environment plot closed during training.

Actions

Depending on the Actions property of the underlying GridWorld model, the action channel carries a scalar integer ranging from either 1 to 4 or 1 to 8.

When Actions is set to "Standard", the integer indicates an (attempted) move in the directions north, south, east, or west, respectively.
When Actions is set to "Kings", the integer indicates an (attempted) move in the directions north, south, east, west, northeast, northwest, southeast and southwest, respectively.

For more information, see Create Grid World Object.

In either case, the action specification is an rlFiniteSetSpec object. To extract the action specification, use getActionInfo.

actInfo = getActionInfo(env)

actInfo = 

  rlFiniteSetSpec with properties:

       Elements: [4×1 double]
           Name: "MDP Actions"
    Description: [0×0 string]
      Dimension: [1 1]
       DataType: "double"

Observations

As in all grid world environments, the environment observation has a single channel carrying a scalar integer from 1 to the number of environment states. The observation indicates the current agent location (that is, the environment state) in column-wise fashion. So, the observation specification is an rlFiniteSetSpec object. To extract the observation specification, use getObservationInfo.

obsInfo = getObservationInfo(env)

obsInfo = 

  rlFiniteSetSpec with properties:

       Elements: [56×1 double]
           Name: "MDP Observations"
    Description: [0×0 string]
      Dimension: [1 1]
       DataType: "double"

Grid World Dynamics

As for all grid world environments, the transition matrix property T of the underlying GridWorld object determines the dynamics.

The default transition matrix is such that any attempted move in any direction results in the agent moving one cell in that direction with a probability of one, except for any attempted move outside the grid, which results in the agent maintaining its current position.

For more information, see Create Grid World Object and Configure Grid World Object.

Rewards

As for all grid world environments, the reward matrix property R of the underlying GridWorld object determines the reward.

The default reward matrix contains only zeroes.

For more information, see Create Grid World Object and Configure Grid World Object.

Reset Function

The state of a custom grid world environment is initially set to 1, which is equivalent to the string "[1,1]", representing the most northwestern position of the grid. The default reset function for a custom grid world environment then sets the initial environment state (that is, the initial position of the agent on the grid), randomly.

x0 = reset(env)

x0 =

    12

You can write your own reset function to specify a different initial state. For example, to specify that the initial state of the environment is always 5, create a reset function that always returns 3, and set the ResetFcn property to the handle of the function.

env.ResetFcn = @() 3;

A training or simulation function automatically calls the reset function at the beginning of each training or simulation episode.

Create a Default Agent for this Environment

The environment observation and action specifications allow you to create an agent (with discrete action space) that works with your environment. For example, create a default AC agent.

acAgent = rlACAgent(obsInfo,actInfo)

acAgent = 

  rlACAgent with properties:

            AgentOptions: [1×1 rl.option.rlACAgentOptions]
    UseExplorationPolicy: 1
         ObservationInfo: [1×1 rl.util.rlFiniteSetSpec]
              ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
              SampleTime: 1

If needed, modify the agent options using dot notation.

acAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
acAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;

You can now use both the environment and the agent as arguments for the built-in functions train and sim, which train or simulate the agent within the environment.

You can also create and train agents for this environment interactively using the Reinforcement Learning Designer app. For an example, see Design and Train Agent Using Reinforcement Learning Designer.

For more information on creating agents, see Reinforcement Learning Agents.

Step Function

As in other MATLAB environments, you can also call the environment step function to return the next observation, reward, and an is-done scalar indicating whether the environment reaches a final state.

For example, call the step function with an action input of 2 to move the agent south.

[xn,rn,id]=step(env,3)

xn =

     4


rn =

     0


id =

  logical

   0

The environment step and reset functions allow you to create a custom training or simulation loop. For more information on custom training loops, see Train Reinforcement Learning Policy Using Custom Training Loop.