TF 2.0 for Reinforcement Learning

Home

Download the notebook or follow along.

In [1]:
import numpy as np

Markov Processes

A Markov Process is a tuple $ \langle \mathcal{S}, \mathcal{P} \rangle $ where $\mathcal{S}$ is a set of states called the observation space or state space that the agent can be in, and $ \mathcal{P} : \mathcal{S}^2 \to \left[ 0, 1 \right]$ is a function describing the probability of transitioning from one state to another. $$ \mathcal{P}(s_t, s_{t+1}) = \mathbb{P} \left[s_{t+1} \vert s_t \right] $$ A Markov Processes are used to model stochastic sequences of states $s_0, s_1, \dots$ satisfying the Markov property: $$ \mathbb{P} \left[ s_{t+1} \vert s_t \right] = \mathbb{P} \left[ s_{t+1} \vert s_0, s_1, \dots, s_t \right] $$ that is, the probability of transitioning from state $s_t$ to state $s_{t+1}$ is independent of previous transitions.

In [2]:
def generate_state_transition_matrix(num_states):
    P = np.random.rand(num_states, num_states)
    for i in range(num_states): # convert each row to a probability distribution by dividing by the total
        P[i] /= sum(P[i])
    return P
In [3]:
num_states = 4
P = generate_state_transition_matrix(num_states)
print(f'P: \n{P}')
P: 
[[0.43365075 0.43263699 0.09132303 0.04238923]
 [0.19473963 0.39793109 0.13873556 0.26859372]
 [0.17437947 0.33206503 0.29333298 0.20022252]
 [0.08424129 0.22288814 0.38442623 0.30844435]]

Trajectories

A trajectory (denoted $\tau$) for a Markov process is a (potentially infinite) sequence of states $$ \tau = \langle s_0, s_1, \dots, s_T \rangle $$ The probability of generating particular trajectories depends on the underlying dynamics of the process, which is determined by $\mathcal{P}$.

In [4]:
def generate_trajectory(P, T=100, s_0 = 0):
    num_states = len(P)
    s_t = s_0
    tau = np.zeros(T, dtype=np.int32)
    tau[0] = s_t
    
    for t in range(1, T):
        # Choose the next state using P[s_t] as the state 
        # transition probability distribution.
        s_t = np.random.choice(np.arange(num_states), p=P[s_t])
        tau[t] = s_t
        
    return tau
In [5]:
T = 100
s_0 = 0
tau = generate_trajectory(P, T, s_0)
print(f'tau: \n{tau}')
tau: 
[0 3 3 3 1 3 1 3 3 1 3 3 3 1 3 3 3 1 2 2 3 1 3 3 3 2 3 3 1 1 2 0 0 0 0 0 3
 3 1 1 1 2 2 2 1 3 1 2 0 1 1 3 3 0 1 1 2 1 2 3 2 1 0 1 0 0 0 1 1 2 1 0 0 1
 0 0 1 1 0 0 0 3 1 3 0 1 0 2 2 3 1 3 2 2 1 3 1 1 3 3]

Markov Reward Processes

A Markov Reward Process is an extension of a Markov Process that allows us to associate rewards with states. Formally, it is a tuple $\langle \mathcal{S}, \mathcal{P}, \mathcal{R} \rangle$ that allows us to associate with each state transition $\langle s_t, s_{t+1} \rangle$ some reward $$ \mathcal{R}(s_t, s_{t+1}) = \mathbb{E}\left[r_t \vert s_t, s_{t+1} \right] $$ which is often simplified to being $\mathcal{R}(s_t)$, the reward of being in a particular state $s_t$. For the purpose of this lesson and essentially all implementations, we make this simplification. The reward $r_t$ can be though of as a measure of how good a certain state $s_t$ is.

In [6]:
def generate_reward_matrix(num_states, mu=0, sigma=1):
    return mu + np.random.randn(num_states)*sigma
In [7]:
num_states = 4
mu = 0
sigma = 4
P = generate_state_transition_matrix(num_states)
R = generate_reward_matrix(num_states, mu, sigma)
print(f'P: \n{P} \nR: \n{R}')
P: 
[[0.55250389 0.13600516 0.11977055 0.19172041]
 [0.08087023 0.21183225 0.56513127 0.14216626]
 [0.24059129 0.1108827  0.32471353 0.32381248]
 [0.36334941 0.43350296 0.05657077 0.14657686]] 
R: 
[ 3.35937682  0.43814485  8.04725463 -5.48089562]
In [8]:
def generate_trajectory(P, R, T=100, s_0 = 0):
    num_states = len(P)
    s_t = s_0
    tau = np.zeros(T, dtype=np.int32)
    tau[0] = s_t
    rewards = np.zeros(T)
    rewards[0] = R[s_0]
    
    for t in range(1, T):
        s_t = np.random.choice(np.arange(num_states), p=P[s_t])
        tau[t] = s_t
        rewards[t] = R[s_t]
        
    return tau, rewards
In [9]:
tau, rewards = generate_trajectory(P, R, 100, 0)
print(f'tau: \n{tau} \nrewards: \n{rewards}')
tau: 
[0 3 0 0 0 0 0 0 0 0 0 0 1 2 3 1 3 1 2 3 1 1 2 0 1 3 1 2 0 0 0 3 0 0 3 1 0
 1 2 3 0 3 3 0 3 1 1 2 3 0 0 2 1 1 0 0 2 2 3 3 3 1 2 0 0 3 1 2 2 3 0 3 1 2
 2 0 1 2 2 2 0 0 0 0 0 0 0 0 0 0 3 2 2 2 0 0 0 0 3 1] 
rewards: 
[ 3.35937682 -5.48089562  3.35937682  3.35937682  3.35937682  3.35937682
  3.35937682  3.35937682  3.35937682  3.35937682  3.35937682  3.35937682
  0.43814485  8.04725463 -5.48089562  0.43814485 -5.48089562  0.43814485
  8.04725463 -5.48089562  0.43814485  0.43814485  8.04725463  3.35937682
  0.43814485 -5.48089562  0.43814485  8.04725463  3.35937682  3.35937682
  3.35937682 -5.48089562  3.35937682  3.35937682 -5.48089562  0.43814485
  3.35937682  0.43814485  8.04725463 -5.48089562  3.35937682 -5.48089562
 -5.48089562  3.35937682 -5.48089562  0.43814485  0.43814485  8.04725463
 -5.48089562  3.35937682  3.35937682  8.04725463  0.43814485  0.43814485
  3.35937682  3.35937682  8.04725463  8.04725463 -5.48089562 -5.48089562
 -5.48089562  0.43814485  8.04725463  3.35937682  3.35937682 -5.48089562
  0.43814485  8.04725463  8.04725463 -5.48089562  3.35937682 -5.48089562
  0.43814485  8.04725463  8.04725463  3.35937682  0.43814485  8.04725463
  8.04725463  8.04725463  3.35937682  3.35937682  3.35937682  3.35937682
  3.35937682  3.35937682  3.35937682  3.35937682  3.35937682  3.35937682
 -5.48089562  8.04725463  8.04725463  8.04725463  3.35937682  3.35937682
  3.35937682  3.35937682 -5.48089562  0.43814485]

Return and Discounted Return

If we think of an agent as living in a Markov Reward Process, and we think of the reward $r_t$ as measuring the 'goodness' of certain states $s_t$, then we are interested in trajectories that maximize the total rewards over a trajectory (think of it as planning a trip that visits the best tourist spots). We call the cumulative rewards over a trajectory the return $R_t$: $$ \begin{align} R_t &= r_t + r_{t+1} + r_{t+2}+ \dots + r_T \\ &= \sum_{k=t}^T r_k \end{align} $$

When $T$ is finite, we say that the trajectory has a finite time horizon and that the environment is episodic (happens in episodes).

In general, we say that Markov Reward Processes always have an infinite time horizon (i.e., $T = \infty$). Episodic environments are just a special case whereby $s_{t+1} = s_t$ for all $t>T$.

If we want to find trajectories that maximize the $R_t$ over an infinite time horizon, the math can get a little hairy without a guarantee that $R_t$ converges. As a result, we might consider discounting rewards exponentially over time in order to guarantee convergence. This line of reasoning leads us to the discounted return $G_t$:

$$ \begin{align} G_t &= r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots \\ &= \sum_{k=t}^\infty \gamma^{k-t} r_k \end{align} $$

where $\gamma$ is a discount factor between $0$ and $1$ (often close to $1$).

We sometimes refer to both the discounted and undiscounted return as just "return" for brevity, and write $G_t$ where for some episodic environments it may be more appropriate to use $R_t$. In fact, it should not be hard to see that $R_t$ is just $G_t$ with $r_t = 0$ for $t > T$ and $\gamma = 1$.


Value Function

We can use the expected value of $G_t$ to determine the value of being a certain state $s_t$:

\begin{align} V(s_t) &= \mathbb{E}\left[ G_t \big\vert s_t \right] \\ &= \mathbb{E} \left[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} \dots\right] \end{align}

Essentially, the value tells us, given a Markov Reward Process with state transition dynamics $\mathcal{P}$, what should I expect $G_t$ to be from this state onwards?

We can decompose $V(s_t)$ into two parts: the immediate reward $r_t$ and the discounted value of being in the next state $s_{t+1}$:

\begin{align} \begin{split} V(s_t) &= \mathbb{E} \left[ G_t \big\vert s_t \right] \\ &= \mathbb{E} \left[ r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots \big\vert s_t \right] \\ &= \mathbb{E}\left[r_{t} + \gamma (r_{t+1} + \gamma r_{t+2} + \dots) \big\vert s_t \right] \\ &= \mathbb{E}\left[ r_{t} + \gamma G_{t+1} \big\vert s_t \right] \\ &= \mathbb{E} \left[ r_{t} + \gamma V(s_{t+1}) \big\vert s_t \right] \\ \end{split} \end{align}

This last form of $V(s_t)$ is known as the Bellman Equation.


TD(0)

Say we are interesting in learning to predict $V(s_t)$. We can use a simple method called TD(0) to approximate $V(s_t)$ when our observation space $\mathcal{S}$ is discrete (i.e., finite). We begin by initializing a vector $V$ with zeros, which will serve as our initial predictions for what $V(s_t)$ is. Technically speaking, it can be seeded with random values and is guaranteed to converge to the right prediction given infinite data, but guessing all zeros is unbiased in terms of sign.

According to the Bellman Equation, we can use

$$ r_t + \gamma V(s_{t+1}) $$

(which we call the TD-target) as an unbiased estimator for $V(s_t)$. This is because we have this form of the equation:

$$ V(s_t) = \mathbb{E} \left[ r_{t} + \gamma V(s_{t+1}) \big\vert s_t \right] $$

Then we can modify our estimate of $V(s_t)$ to more closely match $r_t + \gamma V(s_{t+1})$ according to a learning rate $\alpha$ as follows:

$$ V(s_t) \gets V(s_t) + \alpha \left( r_t + \gamma V(s_{t+1}) - V(s_t) \right) $$

where $r_t + \gamma V(s_{t+1}) - V(s_t)$ is the difference between our current prediction $V(s_t)$ and the slightly better estimate: $r_t + V(s_{t+1})$. By adding this difference to our current prediction, scaled by some learning factor $\alpha$, our estimate $V(s_t)$ slowly converges to the correct value.

Let's use the TD(0) algorithm to attempt to learn the value function of this Markov reward process.

In [10]:
def TD_0(P, R, T=100, s_0=0, gamma=0.99, alpha=0.1):
    num_states = len(P)
    V = np.zeros(num_states)
    s_t = s_0
    r_t = R[s_t]
    
    for t in range(1, T):
        s_t_next = np.random.choice(np.arange(num_states), p=P[s_t])
        V[s_t] = V[s_t] + alpha*(r_t + gamma*V[s_t_next] - V[s_t])
        s_t = s_t_next
        r_t = R[s_t]
    return V
In [11]:
V = TD_0(P, R, 10000, 0, 0.99, 0.1)
print(f'V: \n{V}')
V: 
[189.26912044 188.4107677  194.03474576 181.27714373]

Markov Decision Processes

A Markov Decision Process (MDP) is an extension of a Markov Reward Process that allows state transitions to be conditional upon some action. Formally, it is a tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle $ where $\mathcal{A}$ is a set of actions available to an agent in a state $s_t$. Our reward is conditional now upon both the state $s_t$ we are in and the action $a_t$ that we took:

$$ \mathcal{R}(s_t, a_t) = \mathbb{E} \left[ r_t \vert s_t, a_t \right] $$

Also, $\mathcal{P}$ is the probability of transitioning to state $s_{t+1}$ given that the current state is $s_t$ and the current action is $a_t$:

$$ \mathcal{P}(s_t, a_t, s_{t+1}) = \mathbb{P} \left[ s_{t+1} \vert s_t, a_t \right] $$

Whereas in an MRP the probability of generating trajectories is dependant upon only the dynamics of the underlying Markov process, in an MDP trajectories also depend on the actions of an agent.

This is an MDP with four states and two actions.

**missing image (MDP)**

In this case, $\mathcal{P}$ is a tensor of shape $4 \times 2 \times 4$ where $\mathcal{P}(i, j, k)$ is the probability of transitioning from state $i$ to state $k$ given action $j$. Furthermore, $\mathcal{R}$ is a matrix of shape $4 \times 2$ where $\mathcal{R}(i, j)$ is the reward associated with choosing action $j$ in state $i$.


Policies

Our goal is to define a policy $\pi$, which can be thought of as a set of rules for choosing actions based on the state. Typically, we represent $\pi$ as a probability distribution: $$ \pi: \mathcal{S} \times \mathcal{A} \to \left[ 0, 1 \right] $$

where we sample $a_t$ from the distribution $$ a_t \sim \pi(\cdot \vert s_t) $$ For discrete action spaces and observation spaces, we can think of this as a table of size $\lvert \mathcal{S} \rvert \times \lvert \mathcal{A} \rvert$ where $\pi_{i,j}$ is the probability of choosing action $j$ in state $i$.

In [12]:
def generate_state_transition_matrix(num_states, num_actions):
    P = np.random.rand(num_states, num_actions, num_states)
    for i in range(num_states):
        for j in range(num_actions):
            P[i, j] /= sum(P[i, j])
    return P
In [13]:
def generate_reward_matrix(num_states, num_actions, mu=0, sigma=1):
    return mu + np.random.randn(num_states, num_actions)*sigma
In [14]:
def generate_policy(num_states, num_actions):
    pi = np.random.rand(num_states, num_actions)
    for i in range(num_states):
        pi[i] /= sum(pi[i])
    return pi
In [15]:
num_states = 4
num_actions = 2 

P = generate_state_transition_matrix(num_states, num_actions)
R = generate_reward_matrix(num_states, num_actions)
pi = generate_policy(num_states, num_actions)
print(f'P: \n{P} \nR: \n{R} \npi: \n{pi}')
P: 
[[[0.17042536 0.34201973 0.10786943 0.37968548]
  [0.16186722 0.38207381 0.16551557 0.29054341]]

 [[0.09679337 0.42781783 0.11451198 0.36087682]
  [0.22995667 0.34899076 0.35566587 0.0653867 ]]

 [[0.4094171  0.07837153 0.28083541 0.23137597]
  [0.14460938 0.19997512 0.29981354 0.35560195]]

 [[0.07871477 0.24551736 0.38380149 0.29196638]
  [0.13551819 0.47594282 0.19870079 0.18983819]]] 
R: 
[[ 0.06807389  1.06863631]
 [-0.1527366  -0.64913176]
 [ 1.10558323  0.02134114]
 [ 2.00332665 -0.63635805]] 
pi: 
[[0.83715368 0.16284632]
 [0.8570525  0.1429475 ]
 [0.18023624 0.81976376]
 [0.05224557 0.94775443]]
In [16]:
def generate_trajectory(P, R, pi, T=100, s_0 = 0):
    num_states = len(P)
    num_actions = len(P[0])
    
    s_t = s_0
    tau = np.zeros(T, dtype=np.int32)
    tau[0] = s_t
    
    a_t = np.random.choice(np.arange(num_actions), p=pi[s_t])
    actions = np.zeros(T, dtype=np.int32)
    actions[0] = a_t 
    
    rewards = np.zeros(T)
    rewards[0] = R[s_t, a_t]
    
    for t in range(1, T):
        s_t = np.random.choice(np.arange(num_states), p=P[s_t, a_t])
        a_t = np.random.choice(np.arange(num_actions), p=pi[s_t])
        tau[t] = s_t
        actions[t] = a_t
        rewards[t] = R[s_t, a_t]

    return tau, actions, rewards
In [17]:
tau, actions, rewards = generate_trajectory(P, R, pi)
print(f'tau: \n{tau} \nactions: \n{actions} \nrewards: \n{rewards}')
tau: 
[0 0 2 0 2 1 3 1 1 3 1 1 3 0 3 1 3 0 1 1 3 1 3 1 3 2 3 2 2 1 1 2 2 3 3 3 2
 0 1 0 0 3 1 1 3 1 3 1 0 1 1 3 1 3 1 3 3 1 2 3 2 3 1 1 0 0 3 2 3 3 1 3 1 1
 3 2 2 3 1 3 1 1 2 3 1 1 1 3 3 3 0 0 1 1 2 2 3 2 3 1] 
actions: 
[0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1
 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0
 1 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0] 
rewards: 
[ 0.06807389  0.06807389  1.10558323  0.06807389  0.02134114 -0.1527366
 -0.63635805 -0.1527366  -0.1527366  -0.63635805 -0.1527366  -0.1527366
 -0.63635805  0.06807389 -0.63635805 -0.1527366  -0.63635805  0.06807389
 -0.1527366  -0.1527366  -0.63635805 -0.1527366  -0.63635805 -0.1527366
 -0.63635805  0.02134114 -0.63635805  0.02134114  0.02134114 -0.1527366
 -0.1527366   0.02134114  0.02134114 -0.63635805 -0.63635805 -0.63635805
  0.02134114  0.06807389 -0.1527366   0.06807389  0.06807389 -0.63635805
 -0.64913176 -0.1527366  -0.63635805 -0.1527366  -0.63635805 -0.64913176
  0.06807389 -0.1527366  -0.1527366  -0.63635805 -0.1527366   2.00332665
 -0.1527366  -0.63635805 -0.63635805 -0.1527366   0.02134114 -0.63635805
  0.02134114 -0.63635805 -0.1527366  -0.64913176  0.06807389  0.06807389
  2.00332665  0.02134114 -0.63635805 -0.63635805 -0.1527366  -0.63635805
 -0.1527366  -0.1527366  -0.63635805  0.02134114  0.02134114 -0.63635805
 -0.1527366  -0.63635805 -0.64913176 -0.1527366   0.02134114 -0.63635805
 -0.1527366  -0.1527366  -0.1527366  -0.63635805 -0.63635805 -0.63635805
  1.06863631  1.06863631 -0.1527366  -0.1527366   1.10558323  0.02134114
 -0.63635805  0.02134114 -0.63635805 -0.1527366 ]

Summary

  1. Markov Decision Processes are mathematical models of the world that consider how the world changes based on underlying state dynamics and decisions of agents. It also measures the goodness of certain states by associating with each state a reward.
  2. We can use the TD(0) algorithm to learn the value function $V$ for discrete state spaces.

Home