TF 2.0 for Reinforcement Learning

Home

Download the notebook or follow along.

In [1]:
import gym

Introduction to the OpenAI Gym Interface

OpenAI has been developing the gym library to help reinforcement learning researchers get started with pre-implemented environments. In the lesson on Markov decision processes, we explicitly implemented $\mathcal{S}, \mathcal{A}, \mathcal{P}$ and $\mathcal{R}$ using matrices and tensors in numpy.

Recall the environment and agent that we discussed in the introduction. When we specified an MDP and a policy, we were abstractly representing the environment and the agent. gym gives us access to predefined environments that are more meaningful than our essentially random environments.

Thus far we have been using discrete observation spaces (i.e., our state is representable by an integer). In keeping with this, we will start by considering a gym environment that also has a discrete observation space: FrozenLake-v0.

From the documentation:

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

The grid is $4 \times 4$, giving us a total of $16$ states.

SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)
In [2]:
env = gym.make("FrozenLake-v0")

We can inspect information about gym environments. Every environment has an observation_space (corresponding to $\mathcal{S}$) and an action_space (corresponding to $\mathcal{A}$). There are many categories of $spaces$ available, but the two that are most common and most important are:

  1. Discrete: When observation spaces or action spaces are discrete, they expect integers. This is what we used in our notebook on Markov decision processes.
    • Every Discrete space has an attribute n corresponding to the number of discrete elements in the space (i.e., number of states or number of actions).
  2. Box: Some observations or action spaces instead take on real values, and can be of various shapes.
    • Every Box space has an attribute shape corresponding to the dimensions of the space. For example, an observation space with shape $(84, 84, 3)$ might correspond to an $84 \times 84 \times 3$ RGB image.
In [3]:
print(f'observation space: \n{env.observation_space} \naction space: \n{env.action_space}')
observation space: 
Discrete(16) 
action space: 
Discrete(4)

Most environments have a finite time limit corresponding to the time horizon $T$. An environment can become terminal if we exceed this time horizon, or if something else happens that causes the environment to end. In the context of an MDP, this is a state where any action we take returns us to the same state. This often happens when our agent "dies".

Environments, when created via gym.make, start out terminal. Whenever an environment is terminal, we need to reset it:

In [4]:
env.reset()
Out[4]:
0

We can see that this returned something. Resetting the environment is used to provide the initial state $s_0$ without requiring some initial action.

In [5]:
observation = env.reset()

In order to generate the next state and rewards, we call the env's step function. The step function accepts an action, and returns four things:

  1. The next state
  2. The reward for the transition
  3. A boolean indicating whether or not the environment is terminal
  4. Additional info that may be relevant for logging but is not intended to be used by the agent.

Since we do not have a way of generating actions yet, we can call the sample method of the env's action_space:

In [6]:
observation_next, reward, done, info = env.step(env.action_space.sample())
print(f'observation: \n{observation} \nreward: \n{reward} \ndone: \n{done} \ninfo: \n{info}')
observation: 
0 
reward: 
0.0 
done: 
False 
info: 
{'prob': 0.3333333333333333}

While we can log trajectories, we can get a better idea on how the environment is changing by calling its render function. Unfortunately, routing the output of this function to a jupyter notebook is not trivial, so for most environments it will pop up in a new window.

In [7]:
env.render()
  (Left)
SFFF
FHFH
FFFH
HFFG

Let's consider an agent who's policy is to choose actions uniformly randomly. We can emulate this by using env.action_space.sample() as our action.

Then the basic agent-environment interaction loop using gym looks something like this:

In [8]:
T = 100
observation = env.reset()
for t in range(T):
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        observation = env.reset()

We will be using gym for its wide range of available environments. As we will see in later lessons, we can also extend the functionality of gym environments by using Wrappers.

Home