Home
Download the notebook or follow along.
import gym
Introduction to the OpenAI Gym Interface¶
OpenAI has been developing the gym
library to help reinforcement learning researchers get started with pre-implemented environments. In the lesson on Markov decision processes, we explicitly implemented $\mathcal{S}, \mathcal{A}, \mathcal{P}$ and $\mathcal{R}$ using matrices and tensors in numpy
.
Recall the environment and agent that we discussed in the introduction. When we specified an MDP and a policy, we were abstractly representing the environment and the agent. gym
gives us access to predefined environments that are more meaningful than our essentially random environments.
Thus far we have been using discrete observation spaces (i.e., our state is representable by an integer). In keeping with this, we will start by considering a gym
environment that also has a discrete observation space: FrozenLake-v0
.
From the documentation:
The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.
The grid is $4 \times 4$, giving us a total of $16$ states.
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
env = gym.make("FrozenLake-v0")
We can inspect information about gym
environments. Every environment has an observation_space
(corresponding to $\mathcal{S}$) and an action_space
(corresponding to $\mathcal{A}$). There are many categories of $spaces$ available, but the two that are most common and most important are:
Discrete
: When observation spaces or action spaces are discrete, they expect integers. This is what we used in our notebook on Markov decision processes.- Every
Discrete
space has an attributen
corresponding to the number of discrete elements in the space (i.e., number of states or number of actions).
- Every
Box
: Some observations or action spaces instead take on real values, and can be of various shapes.- Every
Box
space has an attributeshape
corresponding to the dimensions of the space. For example, an observation space with shape $(84, 84, 3)$ might correspond to an $84 \times 84 \times 3$ RGB image.
- Every
print(f'observation space: \n{env.observation_space} \naction space: \n{env.action_space}')
Most environments have a finite time limit corresponding to the time horizon $T$. An environment can become terminal if we exceed this time horizon, or if something else happens that causes the environment to end. In the context of an MDP, this is a state where any action we take returns us to the same state. This often happens when our agent "dies".
Environments, when created via gym.make
, start out terminal. Whenever an environment is terminal, we need to reset
it:
env.reset()
We can see that this returned something. Resetting the environment is used to provide the initial state $s_0$ without requiring some initial action.
observation = env.reset()
In order to generate the next state and rewards, we call the env
's step
function. The step
function accepts an action, and returns four things:
- The next state
- The reward for the transition
- A boolean indicating whether or not the environment is terminal
- Additional info that may be relevant for logging but is not intended to be used by the agent.
Since we do not have a way of generating actions yet, we can call the sample
method of the env
's action_space
:
observation_next, reward, done, info = env.step(env.action_space.sample())
print(f'observation: \n{observation} \nreward: \n{reward} \ndone: \n{done} \ninfo: \n{info}')
While we can log trajectories, we can get a better idea on how the environment is changing by calling its render
function. Unfortunately, routing the output of this function to a jupyter notebook is not trivial, so for most environments it will pop up in a new window.
env.render()
Let's consider an agent who's policy is to choose actions uniformly randomly. We can emulate this by using env.action_space.sample()
as our action.
Then the basic agent-environment interaction loop using gym
looks something like this:
T = 100
observation = env.reset()
for t in range(T):
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
observation = env.reset()
We will be using gym
for its wide range of available environments. As we will see in later lessons, we can also extend the functionality of gym
environments by using Wrappers
.