← Back to home

Advantage in Reinforcement Learning

In Reinforcement Learning (RL), we define the advantage $A^{\pi}(s,a)$ of a policy $\pi$ as: $$ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) $$ One way to interpret the advantage is as the relative value of taking a certain action $a$ from a state $s$, and then following the policy (This is exactly what $Q^{\pi}(s,a)$ represents), as compared to the expected value of following the policy from state $s$ (This is exactly what $V^{\pi}(s)$ represents)

Advantage is equal to expected TD error

Another way of interpreting the advantage is as the expected TD error: $$ \begin{aligned} A^{\pi}(s,a) &= Q^{\pi}(s,a) - V^{\pi}(s) &\quad& \text{(definition of advantage)} \\ &= \mathbb{E}_{s',r \sim T,R}[r + \gamma \cdot V^{\pi}(s')] - V^{\pi}(s) &\quad& \text{(substitute $Q$)} \\ &= \mathbb{E}_{s',r \sim T,R}[r + \gamma \cdot V^{\pi}(s') - V^{\pi}(s)] &\quad& \text{(linearity of expectation)} \end{aligned} $$ Where $T$, $R$, and $\gamma$ are the transition dynamics, reward function, and discount factor for the MDP. $r + \gamma V^{\pi}(s) - V^{\pi}(s)$ is the (one-step) TD-Error.

Expected Advantage is 0

If we take the expectation of the advantage over the distriubtion of actions from the policy (in a given state $s$), we ge the following: $$ \begin{aligned} \mathbb{E}_{a \sim \pi(*|s)}[A^{\pi}(s,a)] &= \mathbb{E}_{a \sim \pi(*|s)}[Q^{\pi}(s,a) - V^{\pi}(s)] &\quad& \text{(expectation applied to advantage definition)} \\ &= \mathbb{E}_{a \sim \pi(*|s)}[Q^{\pi}(s,a)] - \mathbb{E}_{a \sim \pi(*|s)}[V^{\pi}(s)] &\quad& \text{(linearity of expectation)} \\ &= V^{\pi}(s) - \mathbb{E}_{a \sim \pi(*|s)}[V^{\pi}(s)] &\quad& \text{(substitute $\mathbb{E}_{a \sim \pi(*|s)}[Q^{\pi}(s,a)]$ for $V^{\pi}(s)$)} \\ &= V^{\pi}(s) - V^{\pi}(s) &\quad& \text{($V^{\pi}(s)$ doesn't depend on $a$ so expectation is ignored)} \\ &= 0 \end{aligned} $$ This means that that the expected advantage (under the policy) is 0, which makes sense when we consider the first interpretation of advantage (relative value of an action and the value of following the policy).