Behavior cloning to score-based policy gradient
This note makes the explicit connection between supervised learning (i.e., behavior cloning) and reinforcement learning (i.e., score-based policy gradients.
We start with defining behavior cloning, which assumes access to expert demonstrations, and calculate the gradient for maximizing its likelihood. Then, we introduce a dataset with bad demonstrations, which we try to (additionally) minimize likelihood on. Then, we introduce reward/advantage weighted regression, which replaces the explicit good/bad labels with returns / advantages. This then ends up looking like the standard score-based policy gradient.
Supervised learning - Behavior cloning with maximum log likelihood
We assume there is:
- An expert policy that defines a distribution over actions conditioned on states,
- A dataset of n state-action pairs that is generated by some (arbitrary) state distribution and querying the expert for actions in those states, so that state-action samples in the dataset are jointly distributed according to .
The maximum log-likelihood objective is to find a policy that optimizes expected log likelihood of state-action pairs from the state-action distribution :
- write code for this
Ultimately, we are finding a policy that maximizes the likelihood of actions that the expert generates, weighted according to an arbitrary state distribution .
Positive and negative examples, and a single dataset to include them all
Let’s assume we also have a dataset of “bad” state-action pairs , where the state-action pairs are sampled from a joint distribution that has (a potentially different) starting state distribution and actions generated by a “bad” policy , resulting a joint distribution that factors in a similar way to the good dataset: .
We can now define an objective where in addition to maximizing likelihood of expert actions, we additional minimize the likelihood of bad examples:
Since the only difference between maximizing and minimizing is whether the we’re multiplying the gradient of the loglikelihood by either positive/negative one, we can rewrite this objective over a single dataset that is a combination of the positive and negative datasets and (therefore containing state-action pairs) and add a label (-1/1) to each transition to denote whether it’s from the good/bad dataset:
Since this dataset is just a union of the good and bad datasets and (with an additional label for which dataset it came from) and has size , we can calculate the probability of a state-action-label datapoint in the new dataset by taking into account what proportion of the dataset is good vs. bad data:
Now we can write an objective using the single dataset, and using the label to handle the sign by multiplying it by the log-likelihood so we can write the loss as one expectation/sum instead of two:
- prove that this is very similar to the loss above that has the 2 separate datasets, but with a reweighting on the data.
Reward-weighted regression (RWR)
References
- original paper (Relative Entropy Policy Search)
- advantage-weighted regression paper: more interpretable / up-to-date imo
At each iteration, reward-weighted regression (RWR) solves the following regression problem:
Where:
- represents the policy at the kth iteration of the algorithm
- is the return (and can be interpreted as an empirical sample from )
- is the unnormalized discounted state distribution induced by the policy
“The RWR update can be interpreted as solving a maximum likelihood problem that fits a new policy to samples collected under the current policy , where the likelihood of each action is weighted by the exponentiated return received for that action, with a temperature paramtere ”
To clarify how similar this is to the Positive and negative examples, and a single dataset to include them all loss and policy gradient, we will
- Set , so that the returns are being used like a “label” to give more weight to maximizing the likelihood of actions based on their value.
- write the optimization problem above as a loss at each iteration and give the gradient w.r.t policy parameters:
The biggest differences between the previous objective and this one are:
-
The distribution of the expectation (RWR is on-policy, and the previous approach used an arbitrary state-distribution and assume actions came from the expert policy). Every time we update the policy, RWR collects a new dataset, whereas we used the same dataset each time we updated the policy in the supervised learning setting.
-
In our newest objective, the label is state-action conditioned and is derived from interaction samples with the environment produced by the policy, whereas before it was just a provided label from an expert. Also, the original label was just and , whereas the new label can be an arbitrary scalar.
-
todo: provide justifcation from original paper for this objective (maximizing returns while staying close to original policy, KL constrained policy improvement)
-
mention that this is on-policy, but we can do this is in an off-policy fashion by incorporating importance sampling (mentioned in AWR paper, Section 3.2).
Advantage-weighted regression (AWR)
Advantage-weighted regression extends the RWR objective above in two key ways:
- Rather than operate on on-policy data, it instead learns from off-policy data (different state-action distribution for the expectation). More specifically, it starts with an empty replay buffer and adds data from each iteration of the policy (Let represent the replay buffer at iteration ).
- It replaces the returns with an estimate of the advantage , where is a trajectory from the dataset containing the mixture of rollouts from past policies, and is a value function that is fitted with empirical returns from .
- todo: give details on how this objective is motivated from policy improvement objective instead of policy maximization.
To make the relationship to Reward-weighted regression (RWR) we can set and define the loss/gradient for each iteration:
References
- original DDPG paper (connects DPPG to stochastic policy gradient)
- REPPO
- Establishes pathwise policy gradient
- Proof that the optima policy to extract from a value function with entropy penalty is soft max.
-
- From policy gradient to actor-critic methods
- pretty sparse, does go over BC (regression and max-likelihood) to RWR and policy gradients. But isn’t formal about the distributions and how to make them look exactly the same / not much math notation.
- From policy gradient to actor-critic methods
- Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
- good information on AWR (ctrl+f AWR for info)
TODOS:
- provide code for each loss and gradient.
- provide the empirical estimates because they match our code implementation.