TF 2.0 for Reinforcement Learning

Home

Download the notebook or follow along.

In [1]:
import gym
import numpy as np

Gym Wrappers

In this lesson, we will be learning about the extremely powerful feature of wrappers made available to us courtesy of OpenAI's gym. Wrappers will allow us to add functionality to environments, such as modifying observations and rewards to be fed to our agent. It is common in reinforcement learning to preprocess observations in order to make them more easy to learn from. A common example is when using image-based inputs, to ensure that all values are between $0$ and $1$ rather than between $0$ and $255$, as is more common with RGB images.

The gym.Wrapper class inherits from the gym.Env class, which defines environments according to the OpenAI API for reinforcement learning. Implementing the gym.Wrapper class requires defining an __init__ method that accepts the environment to be extended as a parameter.

In [2]:
class BasicWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.env = env
        
    def step(self, action):
        next_state, reward, done, info = self.env.step(action)
        # modify ...
        return next_state, reward, done, info
In [3]:
env = BasicWrapper(gym.make("CartPole-v0"))

We can modify specific aspects of the environment by using subclasses of gym.Wrapper that override how the environment processes observations, rewards, and action.

The following three classes provide this functionality:

  1. gym.ObservationWrapper: Used to modify the observations returned by the environment. To do this, override the observation method of the environment. This method accepts a single parameter (the observation to be modified) and returns the modified observation.
  2. gym.RewardWrapper: Used to modify the rewards returned by the environment. To do this, override the reward method of the environment. This method accepts a single parameter (the reward to be modified) and returns the modified reward.
  3. gym.ActionWrapper: Used to modify the actions passed to the environment. To do this, override the action method of the environment. This method accepts a single parameter (the action to be modified) and returns the modified action.
In [4]:
class ObservationWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def observation(self, obs):
        # modify obs
        return obs
    
class RewardWrapper(gym.RewardWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def reward(self, rew):
        # modify rew
        return rew
    
class ActionWrapper(gym.ActionWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def action(self, act):
        # modify act
        return act

Wrappers can be used to modify how an environment works to meet the preprocessing criteria of published papers. The OpenAI Baselines implementations include wrappers that reproduce preprocessing used in the original DQN paper and susbequent Deepmind publications.

Here we define a wrapper that takes an environment with a gym.Discrete observation space and generates a new environment with a one-hot encoding of the discrete states, for use in, for example, neural networks.

In [5]:
class DiscreteToBoxWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete), \
            "Should only be used to wrap Discrete envs."
        self.n = self.observation_space.n
        self.observation_space = gym.spaces.Box(0, 1, (self.n,))
    
    def observation(self, obs):
        new_obs = np.zeros(self.n)
        new_obs[obs] = 1
        return new_obs
In [6]:
env = DiscreteToBoxWrapper(gym.make("FrozenLake-v0"))
T = 10
s_t = env.reset()
for t in range(T):
    a_t = env.action_space.sample()
    s_t, r_t, done, info = env.step(a_t)
    print(s_t)
    if done:
        s_t = env.reset()
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Going Beyond the Wrapper Class

It is possible to apply the concept of wrappers beyond what is defined here to add functionality to the environment, such as providing auxillary observation functions that allow for multiple preprocessing streams to occur.

In more complex applications of deep reinforcement learning, evaluating the policy can take significantly longer than stepping the environment. This means that the majority of computational time is spent choosing actions, which makes data collection slow. Since deep reinforcement learning is extremely data intensive (often requiring millions of timesteps of experience to achieve good performance), we should prioritize rapidly acquiring data.

The following class accepts a function that returns an environment, and returns a vectorized version of the environment. It essentially generates $n$ copies of the environment. Its step function expects a vector of $n$ actions, and returns vectors of $n$ next states, $n$ rewards, $n$ done flags, and $n$ infos.

In [7]:
class VectorizedEnvWrapper(gym.Wrapper):
    def __init__(self, make_env, num_envs=1):
        super().__init__(make_env())
        self.num_envs = num_envs
        self.envs = [make_env() for env_index in range(num_envs)]
    
    def reset(self):
        return np.asarray([env.reset() for env in self.envs])
    
    def reset_at(self, env_index):
        return self.envs[env_index].reset()
    
    def step(self, actions):
        next_states, rewards, dones, infos = [], [], [], []
        for env, action in zip(self.envs, actions):
            next_state, reward, done, info = env.step(action)
            next_states.append(next_state)
            rewards.append(reward)
            dones.append(done)
            infos.append(info)
        return np.asarray(next_states), np.asarray(rewards), \
            np.asarray(dones), np.asarray(infos)
In [8]:
num_envs = 128
env = VectorizedEnvWrapper(lambda: gym.make("CartPole-v0"), num_envs=num_envs)
T = 10
observations = env.reset()
for t in range(T):
    actions = np.random.randint(env.action_space.n, size=num_envs)
    observations, rewards, dones, infos = env.step(actions)  
    for i in range(len(dones)):
        if dones[i]:
            observations[i] = env.reset_at(i)
print(observations.shape)
print(rewards.shape)
print(dones.shape)
(128, 4)
(128,)
(128,)

Home