← Back to home

Reinforcement Learning 101

Reinforcement Learning (RL) is a framework and collection of approaches for creating agents that can interact with environments to achieve goals. Some natural questions to ask are: We will answer all these questions, starting with defining what an environment is.

Environment is modeled as a Markov Decision Process (MDP)

We model an environment as a Markov Decision Process (MDP), which is defined as a 5-tuple $MDP = \{S,A,T,R,\gamma\}$, where To understand why an MDP is a useful model of an environment, let's define what an agent is (which is the entity that interacts with the environment).

An agent is modeled as a policy

A policy $\pi(a|s) = Pr(a|s): S \times A \rightarrow \mathbb{R}$ is the (conditional) probability of an action $a$ from a state $s$. Intuitively, an agent's behavior is determined by how it acts in different situations. A policy $\pi(a|s)$ determines how an agent acts in different situations, because we can sample actions from the policy given a state $a \sim \pi(*|s)$.

Policies generate trajectories by interacting with environment

When an agent (policy) interacts with an environment (MDP) for $T$ timesteps, it generates a trajectory $\tau_T$. More formally, we define a trajectory as a sequence of $T$ steps, where each step contains a state, action, and reward: $$ \tau_{T} = \{s_0,a_0,r_0,s_1,a_1,r_1,...,s_{T-1},a_{T-1},r_{T-1}, s_{T}\} $$ Take note our definition of a trajectory $\tau_{T}$ has $T+1$ states in it, but only $T$ actions and rewards. Here is why: we can think of the environment at time $t=0$ as being in some start state $s_{0}$. The agent will then sample an action $a_{0} \sim \pi(*|s_0)$, and the environment will produce a reward $r_0$ and next state $s_1$, and we call this a step. So each step generates an action, reward, and (next) state, and so if we take $T$ steps, and we started with an initial state, we get $T+1$ states and $T$ action and rewards.

Trajectories have a cumulative discounted reward

To formalize the idea that some trajectories are "better", we want to be able to associate a scalar number to each trajectory, and "better" trajectories should be assigned higher numbers. The best trajectory has the bigger number! We formalize this by defining a function $G(\tau): \tau_{T} \in \mathbb{T} \rightarrow {\mathbb{R}}$ that maps a trajectory to a single number representing the cumulative discounted rewards: $$ G(\tau_T) = \sum_{t=0}^{T-1} \gamma^{t} \cdot r_{t} $$ Cumulative discounted rewards $G(\tau)$ are crucial for defining what we mean by goals. Although trajectories $\tau$ that make $G(\tau)$ go up high are "good", we also have to account for how likely the trajectory is.

Probability of trajectory depends on MDP and policy

We can formalize our aforementioned intuition for how an agent interacts with an environment by defining the probability of a trajectory in a highly factorized way. We won't go into details here (it is best described using graphical models), but our intuition is equivalent to making a bunch of conditional independence assumptions that are implied by our definitions of the transition dynamics, reward function, and policy all being function of the current state (and action). This is what the word Markov in MDP refers to.

For a specific environment $MDP$ and policy $\pi$, we can define the probability of a trajectory $\tau_{T}$ as: $$ Pr_{\pi,T,R}(\tau_{T}) = Pr(s_0) \cdot \prod_{t=1}^{T} \pi(a_{t-1} | s_{t-1}) \cdot R(r_{t-1}|s_{t-1},a_{t-1}) \cdot T(s_t | s_{t-1}, a_{t-1}) $$ Now that we have defined that we have defined the (cumulative discounted) value of a trajectory, as well as the probability of a trajectory, we can define a goal via the reward hypothesis

Goals are maximizing Cumulative Discounted Expected (CDE) rewards

To quote Rich Sutton, the reward hypothesis is defined as :

"That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)."

We can formalize this statement as following optimization problem: Find the policy $\pi$ from a space of policies $\Pi$ that maximizes the Cumulative Discounted Expected rewards (CDE): \begin{aligned} \pi^{*} &= \arg \max_{\pi \in \Pi} \mathbb{E}_{\tau \sim Pr(*)}[G(\tau)] \\ \end{aligned} In order to final optimal policies, it helps to understand the CDE rewards of a policy from different states. We formalize this idea with the idea of a value function $V^{\pi}(s)$

Value functions represent CDE rewards from a state

The value function $V^{\pi}(s)$ is defined as the CDE rewards conditioned on running the policy $\pi$ from a given state $s$: $$ V^{\pi}(s) = \mathbb{E}_{\tau \sim Pr(*)}[G(\tau) | s_0 = s] $$ Given a policy $\pi$ and an MDP, estimating $V^{\pi}(s)$ is called policy evaluation. One way to understand why the value function is useful (i.e: why evaluating a policy is useful) is to define our objective above with it: \begin{aligned} \pi^{*} &= \arg \max_{\pi \in \Pi} \mathbb{E}_{\tau \sim Pr(*)}[G(\tau)] \\ &= \arg \max_{\pi \in \Pi} \mathbb{E}_{s \sim d(*)}[V^{\pi}(s)] \end{aligned} Notice how in the first line, the expectation is over a distributions of trajectories, while in the second line, the expectation is over a distribution of (starting) states. This means that the cumulative discounted expected rewards of a policy (over a distribution of trajectories) is equal to the expected value function of a policy (over the distribution of starting states).