Behavior cloning to score-based policy gradient

This note makes the explicit connection between supervised learning (i.e., behavior cloning) and reinforcement learning (i.e., score-based policy gradients.

We start with defining behavior cloning, which assumes access to expert demonstrations, and calculate the gradient for maximizing its likelihood. Then, we introduce a dataset with bad demonstrations, which we try to (additionally) minimize likelihood on. Then, we introduce reward/advantage weighted regression, which replaces the explicit good/bad labels with returns / advantages. This then ends up looking like the standard score-based policy gradient.

Supervised learning - Behavior cloning with maximum log likelihood

We assume there is:

An expert policy $π_{+} (a ∣ s)$ that defines a distribution over actions conditioned on states,
A dataset of n state-action pairs $D_{+} = {(s_{i}, a_{i})}_{i = 1}^{n}$ that is generated by some (arbitrary) state distribution $d_{+} (s)$ and querying the expert for actions in those states, so that state-action samples in the dataset $D_{+}$ are jointly distributed according to $P r_{+} (s, a) = d_{+} (s) π_{+} (a ∣ s)$ .

The maximum log-likelihood objective is to find a policy $π_{θ}$ that optimizes expected log likelihood of state-action pairs from the state-action distribution $P r_{+} (s, a)$ :

L (π_{θ}) \nabla_{θ} L (π_{θ}) = E_{(s, a) \sim P r_{+} (*)} [lo g (π_{θ} (a ∣ s))] \approx \frac{1}{n} i = 1 \sum n lo g (π_{θ} (a_{i} ∣ s_{i})) \approx \frac{1}{n} i = 1 \sum n \nabla_{θ} lo g (π_{θ} (a_{i} ∣ s_{i}))

write code for this

Ultimately, we are finding a policy that maximizes the likelihood of actions that the expert generates, weighted according to an arbitrary state distribution $d_{+} (s)$ .

Positive and negative examples, and a single dataset to include them all

Let’s assume we also have a dataset of $m$ “bad” state-action pairs $D_{-} = {(s_{j}, a_{j})}_{j = 1}^{m}$ , where the state-action pairs are sampled from a joint distribution that has (a potentially different) starting state distribution $d_{-} (s)$ and actions generated by a “bad” policy $π_{-}$ , resulting a joint distribution that factors in a similar way to the good dataset: $P r_{-} (s, a) = d_{-} (s) * π_{-} (a ∣ s)$ .

We can now define an objective where in addition to maximizing likelihood of expert actions, we additional minimize the likelihood of bad examples:

L (π_{θ}) \nabla_{θ} L (π_{θ}) = E_{(s, a) \sim P r_{+} (*)} [lo g (π_{θ} (a ∣ s))] - E_{(s, a) \sim P r_{-} (*)} [lo g (π_{θ} (a ∣ s))] \approx \frac{1}{n} i = 1 \sum n lo g (π_{θ} (a_{i} ∣ s_{i})) - \frac{1}{m} j = 1 \sum m lo g (π_{θ} (a_{j} ∣ s_{j})) \approx \frac{1}{n} i = 1 \sum n \nabla_{θ} lo g (π_{θ} (a_{i} ∣ s_{i})) - \frac{1}{m} j = 1 \sum m \nabla_{θ} lo g (π_{θ} (a_{j} ∣ s_{j}))

Since the only difference between maximizing and minimizing is whether the we’re multiplying the gradient of the loglikelihood by either positive/negative one, we can rewrite this objective over a single dataset $D$ that is a combination of the positive and negative datasets $D_{+}$ and $D_{-}$ (therefore containing $q = n + m$ state-action pairs) and add a label (-1/1) to each transition to denote whether it’s from the good/bad dataset:

D Ψ_{k} = {(s_{k}, a_{k}, Ψ_{k})}_{k = 1}^{q} = {1, - 1, (k \leq n, good state-action pair) (k > n, bad state-action pair)

Since this dataset $D$ is just a union of the good and bad datasets $D_{+}$ and $D_{-}$ (with an additional label for which dataset it came from) and has size $q = n + m$ , we can calculate the probability of a state-action-label datapoint in the new dataset $D$ by taking into account what proportion of the dataset is good vs. bad data:

P r (s_{k}, a_{k}, Ψ_{k}) P r (Ψ_{k}) d (s_{k} ∣ Ψ_{k}) π (a_{k} ∣ s_{k}, Ψ_{k}) = P r (Ψ_{k}) d (s_{k} ∣ Ψ_{k}) π (a_{k} ∣ s_{k}, Ψ_{k}) = {\frac{n}{n + m} \frac{m}{n + m} (fraction of data that is good) (fraction of data that is bad) = {d_{+} (s_{k}), d_{-} (s_{k}), k \leq n k > n = {π_{+} (a_{k} ∣ s_{k}), π_{-} (a_{k} ∣ s_{k}), k \leq n k > n

Now we can write an objective using the single dataset, and using the label $Ψ$ to handle the sign by multiplying it by the log-likelihood so we can write the loss as one expectation/sum instead of two:

L (π_{θ}) \nabla_{θ} L (π_{θ}) = E_{(s, a, Ψ) \sim P r_{(} *)} [Ψ lo g (π_{θ} (a ∣ s))] \approx \frac{1}{q} k = 1 \sum q Ψ_{k} lo g (π_{θ} (a_{k} ∣ s_{k})) \approx \frac{1}{q} k = 1 \sum q Ψ_{k} \nabla_{θ} lo g (π_{θ} (a_{k} ∣ s_{k}))

prove that this is very similar to the loss above that has the 2 separate datasets, but with a reweighting on the data.

Reward-weighted regression (RWR)

References

original paper (Relative Entropy Policy Search)
advantage-weighted regression paper: more interpretable / up-to-date imo

At each iteration, reward-weighted regression (RWR) solves the following regression problem:

π_{k + 1} = ar g π max E_{s \sim d_{π_{k}} (s)} E_{a \sim π_{k} (a ∣ s)} [exp (\frac{1}{β} R_{s, a}) lo g π (a ∣ s)]

Where:

$π_{k}$ represents the policy at the kth iteration of the algorithm
$R_{s, a} = \sum_{t = 0}^{\infty} γ^{t} r_{t}$ is the return (and can be interpreted as an empirical sample from $Q^{π_{k}} (s, a)$ )
$d_{π_{k}} (s)$ is the unnormalized discounted state distribution induced by the policy $π_{k}$

“The RWR update can be interpreted as solving a maximum likelihood problem that fits a new policy $π_{k + 1}$ to samples collected under the current policy $π_{k}$ , where the likelihood of each action is weighted by the exponentiated return received for that action, with a temperature paramtere $β > 0$ ”

To clarify how similar this is to the Positive and negative examples, and a single dataset to include them all loss and policy gradient, we will

Set $R_{s, a} = Ψ (s, a)$ , so that the returns are being used like a “label” to give more weight to maximizing the likelihood of actions based on their value.
write the optimization problem above as a loss $L_{k} (π_{θ})$ at each iteration $k$ and give the gradient w.r.t policy parameters:

L_{k} (π_{θ}) \nabla_{θ} L_{k} (π_{θ}) = E_{s \sim d_{π_{k}} (s), a \sim π_{k} (a ∣ s)} [Ψ (s, a) lo g π_{θ} (a ∣ s)] \approx \frac{1}{n} i = 1 \sum n Ψ (s_{i}, a_{i}) lo g π_{θ} (a_{i} ∣ s_{i}) \approx \frac{1}{n} i = 1 \sum n Ψ (s_{i}, a_{i}) \nabla_{θ} lo g π_{θ} (a_{i} ∣ s_{i})

The biggest differences between the previous objective and this one are:

The distribution of the expectation (RWR is on-policy, and the previous approach used an arbitrary state-distribution and assume actions came from the expert policy). Every time we update the policy, RWR collects a new dataset, whereas we used the same dataset each time we updated the policy in the supervised learning setting.
In our newest objective, the label $Ψ$ is state-action conditioned and is derived from interaction samples with the environment produced by the policy, whereas before it was just a provided label from an expert. Also, the original label was just $- 1$ and $1$ , whereas the new label $Ψ (s, a)$ can be an arbitrary scalar.
todo: provide justifcation from original paper for this objective (maximizing returns while staying close to original policy, KL constrained policy improvement)
mention that this is on-policy, but we can do this is in an off-policy fashion by incorporating importance sampling (mentioned in AWR paper, Section 3.2).

Advantage-weighted regression (AWR)

Advantage-weighted regression extends the RWR objective above in two key ways:

Rather than operate on on-policy data, it instead learns from off-policy data (different state-action distribution for the expectation). More specifically, it starts with an empty replay buffer $D$ and adds data from each iteration of the policy $π_{k}$ (Let $D_{k}$ represent the replay buffer at iteration $k$ ).
It replaces the returns $R_{s, a}$ with an estimate of the advantage $A_{k}^{D} = R_{s, a} - V_{k}^{D}$ , where $R_{s, a}$ is a trajectory from the dataset containing the mixture of rollouts from past policies, and $V_{k}^{D}$ is a value function that is fitted with empirical returns from $D_{k}$ .

π_{k + 1} = ar g π max E_{s, a \sim D_{k}} [lo g π (a ∣ s) exp (\frac{1}{β} (R_{s, a} - V_{k}^{D} (s)))] = ar g π max E_{s, a \sim D_{k}} [lo g π (a ∣ s) exp (\frac{1}{β} A_{k}^{D} (s, a))]

todo: give details on how this objective is motivated from policy improvement objective instead of policy maximization.

To make the relationship to Reward-weighted regression (RWR) we can set $A_{k}^{D} s, a = Ψ (s, a)$ and define the loss/gradient for each iteration:

L_{k} (π_{θ}) \nabla_{θ} L_{k} (π_{θ}) = E_{s, a \sim D_{k}} [Ψ (s, a) lo g π_{θ} (a ∣ s)] \approx \frac{1}{n} i = 1 \sum n Ψ (s_{i}, a_{i}) lo g π_{θ} (a_{i} ∣ s_{i}) \approx \frac{1}{n} i = 1 \sum n Ψ (s_{i}, a_{i}) \nabla_{θ} lo g π_{θ} (a_{i} ∣ s_{i})

TODO: scored based policy gradient. PPO
Relationship to DAgger

References

original DDPG paper (connects DPPG to stochastic policy gradient)
REPPO
- Establishes pathwise policy gradient
Proof that the optima policy to extract from a value function with entropy penalty is soft max.
- From policy gradient to actor-critic methods
  - pretty sparse, does go over BC (regression and max-likelihood) to RWR and policy gradients. But isn’t formal about the distributions and how to make them look exactly the same / not much math notation.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
- good information on AWR (ctrl+f AWR for info)

TODOS:

provide code for each loss and gradient.
provide the empirical estimates because they match our code implementation.

Continually Learning Blog

Explorer

Behavior Cloning to Policy Gradients

Behavior cloning to score-based policy gradient

Supervised learning - Behavior cloning with maximum log likelihood

Positive and negative examples, and a single dataset to include them all

Reward-weighted regression (RWR)

Advantage-weighted regression (AWR)

References

TODOS:

Graph View

Table of Contents

Backlinks