TF 2.0 for Reinforcement Learning

Home

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import gym
import tensorflow as tf
sns.set()

Function Approximation in Tensorflow

Function approximation is a technique for learning a function $y$ by providing an approximation for the function, $\hat{y}$. A differentiable function approximator is a function whose output is a differentiable function of its inputs. There are many differentiable function approximators. You have have heard of linear regressions and logistic regressions. We can abstractly define a function approximator as a set of parameters $\theta$. For example, in a simple quadratic regression of the form $$ \hat{y} = a x^2 + b x + c $$ we would have $\theta = \left[ a, b, c \right]$.


Gradient Descent and Loss Functions

When a function approximator is differentiable, we have an additional tool at our disposal: gradient descent. Gradient descent is a tool for optimizing any differentiable function. The gradient of a function $f$ with respect to some parameters $\theta$ is denoted $\nabla_\theta f$, and is a vector of partial derivatives of $f$ with respect to each variable in $\theta$. For example, $\nabla_\theta \hat{y}$ would be $$ \nabla_\theta \hat{y} = \begin{bmatrix} x^2 \\ x \\ 1 \end{bmatrix} $$

Imagine some dataset $(X, Y)$ that we are interested in modelling, and we have a suspicion that the relationship between $X$ and $Y$ is quadratic. Then we can use a function like the one described above ($ax^2 + bx + c$) to try to model the dataset.

Given a data point $(x, y)$, we want the difference between our function approximator's output $\hat{y}$ and the true value $y$ to be small. We use the squared-error to measure this difference: $\left(y - \hat{y} \right)^2$. Our goal is to minimize the average squared error over all data points.

We define a loss function with respect to the model parameters $\theta$ that we want to minimize. In accordance with our description above, we get: $$ L(\theta) = \frac{1}{N} \sum_{(x, y) \in X \times Y} \left(\hat{y} - y \right)^2 $$ where $N$ is the size of the dataset and $\hat{y} = ax^2 + bx + c $.

We can iteratively modify our paramaters $\theta$ so as to minimize this loss function. If we take the gradient of the loss function with respect to $\theta$ and we subtract that from $\theta$, then we get new values for $\theta$ such that the value $L(\theta)$ is smaller. This gives us gradient descent: $$ \theta \gets \theta + \alpha \nabla_\theta L(\theta) $$ where $\alpha$ is a factor called the learning rate that determines how much we change the network parameters with each application of gradient descent. We perform gradient descent for a given number of epochs until we are satisfied that our function does a good approximation.

In this case, we can easily analytically compute this: $$ \nabla_\theta L(\theta) = \frac{1}{N} \sum_{(x, y) \in (X, Y)} 2(\hat{y} - y) \nabla_{\theta} \hat{y} $$

where $\nabla_\theta \hat{y}$ is as descibed above.

In [2]:
def plot(X, Y, Y_hat):
    data = pd.DataFrame({"X":X, "Y":Y, "Y_hat":Y_hat})
    sns.scatterplot(x="X", y="Y", data=data)
    sns.lineplot(x="X", y="Y_hat", data=data)
In [3]:
N = 50
a, b, c = 2, -1, 4
X = np.linspace(-1, 1, N)
Y = a*X**2 + b*X + c + np.random.randn(N)*0.1
In [4]:
theta = np.random.rand(3) # a, b, c estimates
Y_hat = theta[0]*X**2 + theta[1]*X + theta[2]
In [5]:
plot(X, Y, Y_hat)
In [6]:
def learn(X, Y, theta, alpha=1e-1, epochs=100):
    for e in range(epochs):
        Y_hat = theta[0]*(X**2) + theta[1]*X + theta[2]
        nabla_theta_y_hat = np.array([
            X**2,
            X,
            np.ones(N)
        ])
        nabla_theta_L = np.mean(2*(Y_hat - Y)*nabla_theta_y_hat, axis=1)
        
        theta = theta - alpha*nabla_theta_L
    return theta
In [7]:
learned_theta = learn(X, Y, theta)
Y_hat = learned_theta[0]*X**2 + learned_theta[1]*X + learned_theta[2]
In [8]:
plot(X, Y, Y_hat)

Tensorflow and Autodifferentiation

It can be extremely cumbersome to manually compute the derivative of our functions. It becomes much more complex when our functions are highly composed, where we combine the results of many intermediate calculations to produce our result. Instead, we can rely on libraries that provide autodifferentiation. There are many approaches to autodifferentiation, with the notion of a 'gradient tape' being used in tensorflow 2 and pytorch. Most guides, resources and open-source implementations that you will find today are built using tensorflow 1.x, a python library that allows us to define complex computation graphs with built-in differentiation. This one uses tensorflow 2.0-alpha, since over the next year (2019) tensorflow will migrate and old code will become obsolete.

In [9]:
N = 50
a, b, c = 2, -1, 4
X = np.linspace(-1, 1, N)
Y = a*X**2 + b*X + c + np.random.randn(N)*0.1
In [10]:
theta = tf.Variable(tf.random.normal(shape=(3,)), name="theta")
Y_hat = theta[0]*(X**2) + theta[1]*X + theta[2]
In [11]:
plot(X, Y, Y_hat.numpy())
In [12]:
def learn(X, Y, theta, alpha=1e-1, epochs=100):
    optimizer = tf.keras.optimizers.SGD(alpha) # autodifferentiation!
    for e in range(epochs):
        with tf.GradientTape() as tape:
            Y_hat = theta[0]*(X**2) + theta[1]*X + theta[2] # our prediction
            L_theta = tf.reduce_mean((Y - Y_hat)**2) # same loss as before
        grads = tape.gradient(L_theta, [theta])
        optimizer.apply_gradients(zip(grads, [theta]))
    return theta
In [13]:
theta = learn(X, Y, theta)
Y_hat = theta[0]*(X**2) + theta[1]*X + theta[2] # our prediction
WARNING: Logging before flag parsing goes to stderr.
W0615 20:57:57.983891 4533315008 deprecation.py:323] From /anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
In [14]:
plot(X, Y, Y_hat)
In [ ]:
 

Home