Reinforcement Learning 101
Reinforcement Learning (RL) is a framework and collection of approaches for creating agents that can interact with environments to achieve goals. Some natural questions to ask are:- What is an environment?
- What is an agent?
- What does it mean for an agent to interact with an environment?
- What is a goal?
- What ways can we make an agent achieve a goal by interacting with an environment?
Environment is modeled as a Markov Decision Process (MDP)
We model an environment as a Markov Decision Process (MDP), which is defined as a 5-tuple $MDP = \{S,A,T,R,\gamma\}$, where- $S$ is set of states
- $A$ is set of actions
- $T(s^{'}|s,a) = Pr(s^{'} | s, a) : S \times A \rightarrow [0,1]$ is the transition dynamics, and defines the (conditional) probability of transitioning to a state $s^{'} \in S$ conditioned on executing action $a \in A$ from state $s \in S$.
- $R(r | s, a) = Pr(r | s, a) : S \times A \times R \rightarrow [0,1]$ is the reward function, and defines the (conditional) probability of receiving a scalar reward $r$ for taking action $a$ from state $s$.
- When the reward is a deterministic function of the state and action, you may see the reward function instead written not as a conditional probability, but as a function $R(r|s,a) : S \times A \rightarrow R$ that maps state and actions directly to a scalar. Our probabilistic formulation generalizes this setting, and makes certain future definitions clearer as will be seen shortly.
- $\gamma : \mathbb{R}$ is the discount factor, and it determines how much less valuable rewards become when they are received in the future (This will be explained in more detail shortly)
An agent is modeled as a policy
A policy $\pi(a|s) = Pr(a|s): S \times A \rightarrow \mathbb{R}$ is the (conditional) probability of an action $a$ from a state $s$.- Defining the policy to be deterministic (i.e: agent's behavior isn't stochastic as a function of state) is just a special case of our probabilistic definition.
Policies generate trajectories by interacting with environment
When an agent (policy) interacts with an environment (MDP) for $T$ timesteps, it generates a trajectory $\tau_T$. More formally, we define a trajectory as a sequence of $T$ steps, where each step contains a state, action, and reward: $$ \tau_{T} = \{s_0,a_0,r_0,s_1,a_1,r_1,...,s_{T-1},a_{T-1},r_{T-1}, s_{T}\} $$- This definition of trajectory (where it ends with state) is somewhat arbitrary, this is just consistent with our following definitions.
Trajectories have a cumulative discounted reward
To formalize the idea that some trajectories are "better", we want to be able to associate a scalar number to each trajectory, and "better" trajectories should be assigned higher numbers. The best trajectory has the bigger number! We formalize this by defining a function $G(\tau): \tau_{T} \in \mathbb{T} \rightarrow {\mathbb{R}}$ that maps a trajectory to a single number representing the cumulative discounted rewards: $$ G(\tau_T) = \sum_{t=0}^{T-1} \gamma^{t} \cdot r_{t} $$- When the discount factor $\gamma=1$, then this is just the cumulative rewards of the trajectory. Since $\gamma$ is between $0$ and $1$ and is exponential in time, we are (exponentially) discounting future rewards, which is why we call $G(\tau)$ the cumulative discounted rewards.
Probability of trajectory depends on MDP and policy
We can formalize our aforementioned intuition for how an agent interacts with an environment by defining the probability of a trajectory in a highly factorized way. We won't go into details here (it is best described using graphical models), but our intuition is equivalent to making a bunch of conditional independence assumptions that are implied by our definitions of the transition dynamics, reward function, and policy all being function of the current state (and action). This is what the word Markov in MDP refers to.For a specific environment $MDP$ and policy $\pi$, we can define the probability of a trajectory $\tau_{T}$ as: $$ Pr_{\pi,T,R}(\tau_{T}) = Pr(s_0) \cdot \prod_{t=1}^{T} \pi(a_{t-1} | s_{t-1}) \cdot R(r_{t-1}|s_{t-1},a_{t-1}) \cdot T(s_t | s_{t-1}, a_{t-1}) $$
- We add $\pi$, $T$, and $R$ to the subscript of $Pr(\tau)$ to emphasize that this probability depends on the distributions of the policy, transition dynamics, and reward function, but when that is not relevant, we just refer to it as $Pr(\tau)$.
- $Pr(s_0) = d(s): S \rightarrow [0,1]$ is called the starting state distribution, and is a distribution over starting states (i.e: how likely a state $s$ is to be the state the agent starts in). This (marginal) distribution can be dependent (or independent) of the policy and transition dynamics.
- Homework Problem: Getting the time indices to all line up takes care! Confirm that our definition of $Pr(\tau_{T})$ doesn't have an "index out of bounds" error.
Goals are maximizing Cumulative Discounted Expected (CDE) rewards
To quote Rich Sutton, the reward hypothesis is defined as :"That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)."
We can formalize this statement as following optimization problem: Find the policy $\pi$ from a space of policies $\Pi$ that maximizes the Cumulative Discounted Expected rewards (CDE): \begin{aligned} \pi^{*} &= \arg \max_{\pi \in \Pi} \mathbb{E}_{\tau \sim Pr(*)}[G(\tau)] \\ \end{aligned}
- We've done all the work to define all the terms in this optimization problem. Note that $Pr(\tau)$ depends on $\pi$, $T$ and $R$, while $G(\tau)$ just depends on $\gamma$ (Confirm this by looking at the definition of $G(\tau)$ and $Pr(\tau)$ and see what is required to compute them given you have $\tau$ already). To emphasize this, we may write $Pr_{\pi,T,R}(\tau)$ to emphasize the probability depends on the policy, which we leverage shortly.
- We call $\pi^{*}$ the optimal policy, and "solving the RL problem" amounts to finding the optimal policy (or at least getting as "close" to optimal as possible). The optimal policy does not need to be unique (i.e: there may be multiple optimal policies for a single MDP).