Advantage

In Reinforcement Learning (RL), we define the advantage $A^{π} (s, a)$ of a policy $π$ as:

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)

One way to interpret the advantage is as the relative value of taking a certain action $a$ from a state $s$ , and then following the policy (this is exactly what $Q^{π} (s, a)$ represents), as compared to the expected value of following the policy from state $s$ (this is exactly what the Value function $V^{π} (s)$ represents).

Advantage is equal to expected TD error

Another way of interpreting the advantage is as the expected TD error:

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s) = E_{s^{'}, r \sim T, R} [r + γ \cdot V^{π} (s^{'})] - V^{π} (s) = E_{s^{'}, r \sim T, R} [r + γ \cdot V^{π} (s^{'}) - V^{π} (s)] (definition of advantage) (substitute Q) (linearity of expectation)

Where $T$ , $R$ , and $γ$ are the transition dynamics, reward function, and discount factor for the MDP. $r + γ V^{π} (s^{'}) - V^{π} (s)$ is the (one-step) TD-Error.

Expected Advantage is 0

If we take the expectation of the advantage over the distribution of actions from the policy (in a given state $s$ ), we get the following:

E_{a \sim π (* ∣ s)} [A^{π} (s, a)] = E_{a \sim π (* ∣ s)} [Q^{π} (s, a) - V^{π} (s)] = E_{a \sim π (* ∣ s)} [Q^{π} (s, a)] - E_{a \sim π (* ∣ s)} [V^{π} (s)] = V^{π} (s) - E_{a \sim π (* ∣ s)} [V^{π} (s)] = V^{π} (s) - V^{π} (s) = 0 (expectation applied to advantage definition) (linearity of expectation) (substitute E_{a \sim π (* ∣ s)} [Q^{π} (s, a)] for V^{π} (s)) (V^{π} (s) doesn’t depend on a so expectation is ignored)

This means that the expected advantage (under the policy) is 0, which makes sense when we consider the first interpretation of advantage (relative value of an action and the value of following the policy).

Continually Learning Blog

Explorer

Advantage

Advantage is equal to expected TD error

Expected Advantage is 0

Graph View

Table of Contents

Backlinks