TF 2.0 for Reinforcement Learning

Home

Download the notebook or follow along.

In [1]:
import random
import gym
import numpy as np
from collections import deque
import tensorflow as tf
import pandas as pd
import seaborn as sns
sns.set()

Deep $Q$ networks (DQN)

In previous notebooks, we have seen how we can use tensorflow and autodifferentiation to do tabular $Q$-learning in the context of a regression problem. While this technique is powerful for environments with small (finite) observation spaces $\mathcal{S}$ and action spaces $\mathcal{A}$, we run into problems when our observation space is continuous (or even just large!).

Tabular $Q$-learning is only guaranteed to converge if all state-action pairs are visited infinitely many times. In practice, this generally just means a very large number to get a reasonable approximation of the $Q$-function. However, when the observation space becomes large (such as using image inputs), it is likely that we only encounter each state-action pair at most once. Thus, $Q$-learning is not guaranteed to converge.

Instead, we want a technique that can estimate $Q$-values such that similar states produce similar outputs. This would allow us to learn from some state-action pairs, and then generalize to other unseen state-action pairs. By using a differentiable function approximator, we get this kind of behaviour. Recall that $Q$-learning is a regression problem, meaning any kind of regression model could work - even a linear regression. However, the most popular model used by deep reinforcement learning researchers is the deep neural network.

If you are familiar with supervised learning in deep learning, you may be familiar with techniques like dropout, batch normalization, and activity regularization. So far, these kinds of techniques do not prove extremely useful in the context of reinforcement learning. Instead, fully-connected neural networks with a small number of hidden units consisting of rectified linear units tend to perform best. (Pieter Abbeel notes that on simple problems, linear feedback control can perform well even in complex environments. Fully-connected neural networks that use ReLU activations function as multi-step piecewise linear feedback controllers, hence their success).

When using neural networks, rather than passing many state-action pairs to the network and predicting a scalar $Q(s_t, a_t)$, we pass only the state $s_t$ and produce a vectorized output $\vec{Q}(s_t)$ where each entry in the vector corresponds to the predicted $Q$-value for each action available to the agent. Note that this necessitates that $\mathcal{A}$ is finite (and generally small), a limitation of $DQN$ that we will overcome later in the section on policy gradients.

**missing image (q-network)**

In this notebook, we make use of tensorflow's keras API to build neural networks. We also take advantage of batched environments to accelerate data collection. The keras api build neural networks that process inputs in batches. This means that if our observation space for a single environment has a shape $84 \times 84 \times 3$, then the network expects inputs of shape $B \times 84 \times 84 \times 3$ where $B$ is the number of inputs in the batch.

When running the tabular $Q$-learning agent in tensorflow in the previous notebook, runtime was considerably slower than the simply numpy-based agent. The computation time spent evaluating and updating the policy dominated the time to perform a single step/update of the agent, compared to the time spent simulating a step in the environment. Ideally, the time spent should be 50% policy evaluation and 50% environment stepping. By using batched environments, we can even this out. Furthermore, this means that we get more data per wall-clock-time, which will accelerate learning. This will be different from most tutorials which use keras, where a single environment is used, and inputs are manipulated to trick keras into treating them like a batch. Increasing the number of environments can also stabilize training by diversifying the collected data over time.

Target Networks and Experience Replay

In this notebook, we are going to implement two techniques required for stabilizing training: target networks and experience replay.

Target Networks

Neural networks are differentiable, meaning that similar inputs produce similar outputs. In most environments, consecutive states are often similar to each other (with small changes occurring as a result of actions chosen). In deep $Q$-learning, we are doing a regression problem with a moving target - we are trying to predict our own output. When we do this, we perform a maximization step over our output:

$$ L(\theta) = \frac{1}{2} \left( r_t + (1-d_t)\gamma \max_{a_{t+1}} \left( Q_\theta(s_{t+1}, a_{t+1}) \right) - Q_\theta(s_t, a_t) \right)^2 $$

Minimizing our prediction error, in general, will tend to increase the value of our prediction for $Q(s_t)$ because of this maximization step. This poses a problem. Consider the following sequence of events:

  1. The agent is in a state $s_t$
  2. The agent chooses an action that maximizes $Q(s_t)$
  3. The state transitions from $s_t$ to $s_{t+1}$, giving a reward of $r_t$ and a terminal flag $d_t$.
  4. Using the transition $s_t, a_t, r_t, s_{t+1}, d_t$, the agent minimizes $L(\theta)$

Since $s_{t+1}$ and $s_t$ are temporally close, they tend to be similar in general. During step 4, we updated our prediction for $Q(s_t)$ according to a maximization over all next possible values of $Q(s_{t+1})$. This tends to increase the value for $Q(s_t)$. When the agent goes to repeat this cycle the next time, its predictions for $Q(s_{t+1})$ will already be higher, because the values for $Q(s_t)$ are higher and $s_t \approx s_{t+1}$.

To handle this, we introduce a target network with parameters $\theta^-$ which is used when computing the TD-target

$$ r_t + (1-d_t)\gamma \max_{a_{t+1}} \left( Q_{\theta^-}(s_{t+1}, a_{t+1}) \right) $$

This way, the act of updating $\theta$ to minimize $L(\theta)$ has no effect on our regression targets. To keep the predictions made by the target networks $Q_{\theta^-}$ somewhat in line with the actual $Q$-network, we synchronize parameters every fixed number of timesteps.

Experience Replay

The vanilla $Q$-learning algorithm (and DQN, as thus far described) uses only the most recent transition to train on. While this makes the agent good at predicting recent $Q$-values, it can cause it to perform worse on older or uncommon transitions. To stabilize training, we instead store transitions in memory and sample them randomly from batches to ensure the agent gets a good mix of experiences at each training step.

In [2]:
class ReplayBuffer:
    def __init__(self, size=1000000):
        self.memory = deque(maxlen=size)
        
    def remember(self, s_t, a_t, r_t, s_t_next, d_t):
        self.memory.append((s_t, a_t, r_t, s_t_next, d_t))
        
    def sample(self, num=32):
        num = min(num, len(self.memory))
        return random.sample(self.memory, num)
In [3]:
class Agent:
    def __init__(self, state_shape, num_actions, num_envs, alpha=0.001, gamma=0.95, epsilon_i=1.0, epsilon_f=0.01, n_epsilon=0.1, hidden_sizes = []):
        self.epsilon_i = epsilon_i
        self.epsilon_f = epsilon_f
        self.n_epsilon = n_epsilon
        self.epsilon = epsilon_i
        self.gamma = gamma

        self.num_actions = num_actions
        self.num_envs = num_envs

        self.Q = tf.keras.models.Sequential()
        self.Q.add(tf.keras.layers.Input(shape=state_shape))
        for size in hidden_sizes:
            self.Q.add(tf.keras.layers.Dense(size, activation='relu', use_bias='false', kernel_initializer='he_uniform', dtype='float64'))
        self.Q.add(tf.keras.layers.Dense(self.num_actions, activation="linear", use_bias='false', kernel_initializer='zeros', dtype='float64'))
        
        # target network
        self.Q_ = tf.keras.models.Sequential()
        self.Q_.add(tf.keras.layers.Input(shape=state_shape))
        for size in hidden_sizes:
            self.Q_.add(tf.keras.layers.Dense(size, activation='relu', use_bias='false', kernel_initializer='he_uniform', dtype='float64'))
        self.Q_.add(tf.keras.layers.Dense(self.num_actions, activation="linear", use_bias='false', kernel_initializer='zeros', dtype='float64'))
        
        self.optimizer = tf.keras.optimizers.Adam(alpha)  
    
    def synchronize(self):
        self.Q_.set_weights(self.Q.get_weights())

    def act(self, s_t):
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.num_actions, size=self.num_envs)
        return np.argmax(self.Q(s_t), axis=1)
    
    def decay_epsilon(self, n):
        self.epsilon = max(
            self.epsilon_f, 
            self.epsilon_i - (n/self.n_epsilon)*(self.epsilon_i - self.epsilon_f))

    def update(self, s_t, a_t, r_t, s_t_next, d_t):
        with tf.GradientTape() as tape:
            Q_next = tf.stop_gradient(tf.reduce_max(self.Q_(s_t_next), axis=1)) # note we use Q_ 
            Q_pred = tf.reduce_sum(self.Q(s_t)*tf.one_hot(a_t, self.num_actions, dtype=tf.float64), axis=1)
            loss = tf.reduce_mean(0.5*(r_t + (1-d_t)*self.gamma*Q_next - Q_pred)**2)
        grads = tape.gradient(loss, self.Q.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.Q.trainable_variables))
In [4]:
class DiscreteToBoxWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete), \
            "Should only be used to wrap Discrete envs."
        self.n = self.observation_space.n
        self.observation_space = gym.spaces.Box(0, 1, (self.n,))
    
    def observation(self, obs):
        new_obs = np.zeros(self.n)
        new_obs[obs] = 1
        return new_obs
In [5]:
class VectorizedEnvWrapper(gym.Wrapper):
    def __init__(self, make_env, num_envs=1):
        super().__init__(make_env())
        self.num_envs = num_envs
        self.envs = [make_env() for env_index in range(num_envs)]
    
    def reset(self):
        return np.asarray([env.reset() for env in self.envs])
    
    def reset_at(self, env_index):
        return self.envs[env_index].reset()
    
    def step(self, actions):
        next_states, rewards, dones, infos = [], [], [], []
        for env, action in zip(self.envs, actions):
            next_state, reward, done, info = env.step(action)
            next_states.append(next_state)
            rewards.append(reward)
            dones.append(done)
            infos.append(info)
        return np.asarray(next_states), np.asarray(rewards), \
            np.asarray(dones), np.asarray(infos)
In [6]:
def plot(data, window=100):
    sns.lineplot(
        data=data.rolling(window=window).mean()[window-1::window]
    )
In [7]:
def train(env_name, T=20000, num_envs=32, batch_size=32, sync_every=100, hidden_sizes=[24, 24], alpha=0.001, gamma=0.95):
    env = VectorizedEnvWrapper(lambda: gym.make(env_name), num_envs)
    state_shape = env.observation_space.shape
    num_actions = env.action_space.n
    agent = Agent(state_shape, num_actions, num_envs, alpha=alpha, hidden_sizes=hidden_sizes, gamma=gamma)
    rewards = []
    buffer = ReplayBuffer()
    episode_rewards = 0
    s_t = env.reset()
    for t in range(T):
        if t%sync_every == 0:
            agent.synchronize()
        
        a_t = agent.act(s_t)
        s_t_next, r_t, d_t, info = env.step(a_t)
        buffer.remember(s_t, a_t, r_t, s_t_next, d_t)
        s_t = s_t_next
        for batch in buffer.sample(batch_size):
            agent.update(*batch)
        agent.decay_epsilon(t/T)
        episode_rewards += r_t

        for i in range(env.num_envs):
            if d_t[i]:
                rewards.append(episode_rewards[i])
                episode_rewards[i] = 0
                s_t[i] = env.reset_at(i)
            
    plot(pd.DataFrame(rewards), window=10)
    return agent
In [8]:
train("CartPole-v0", T=20000, num_envs=32, batch_size=1)
WARNING: Logging before flag parsing goes to stderr.
W0615 21:46:40.499374 4411987392 deprecation.py:323] From /anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Out[8]:
<__main__.Agent at 0x1a2449d668>

Home