In Reinforcement Learning (RL), we define the advantage of a policy as:

One way to interpret the advantage is as the relative value of taking a certain action from a state , and then following the policy (this is exactly what represents), as compared to the expected value of following the policy from state (this is exactly what the Value function represents).

Advantage is equal to expected TD error

Another way of interpreting the advantage is as the expected TD error:

Where , , and are the transition dynamics, reward function, and discount factor for the MDP. is the (one-step) TD-Error.

Expected Advantage is 0

If we take the expectation of the advantage over the distribution of actions from the policy (in a given state ), we get the following:

This means that the expected advantage (under the policy) is 0, which makes sense when we consider the first interpretation of advantage (relative value of an action and the value of following the policy).