Summary
We will consider taking the gradients of three different expressions w.r.t :
All three of these expressions are Expected values over the same two terms and (representing the integrand and probability distribution respectively), but the difference between these expressions is which of these terms depend on .
This note derives the gradients for each of these objectives, with final forms shown below:
Motivation
- In , only the argument to the expected value depends on .
- Example: In supervised learning, represents the parameters to a model, is a fixed dataset (and therefore does not change as we change our model parameters), and is a loss function on for a sample from the dataset.
- Calculating lets us use gradient descent to iteratively update the model weights to better fit the dataset.
- In , only the probability to the expected value depends on .
- Example: In Reinforcement Learning, the Value function for a policy in a given state is an expected value. The probability of a trajectory depends on the policy parameters, and the integrand is the cumulative discounted rewards of a trajectory and therefore does not depend on the policy parameters .
- Calculating is called the policy gradients and lets us use gradient descent to iteratively update the policy parameters to maximize cumulative discounted expected rewards.
- In , both the probability and the integrand depend on :
- Example: In maximum entropy reinforcement learning, in addition to maximizing returns , we also want to maximize the entropy of the policy, . So now both the probability distribution and the integrand depend on the policy parameters .
- Similar to the setting, calculating is the policy gradient for a “fancier” objective than the standard RL one, and lets us iteratively update policy parameters .
We will now address in turn how to calculate the gradients
1) : Integrand F depends on parameters
Numerical calculation
Because , we can approximate the gradient of w.r.t by using the law of large numbers and use the empirical average to estimate the expectation by taking an average of the gradients over the dataset:
2) : Probability P depends on parameters
For , we just calculated the gradient directly, and ended up with an expression that was an expectation, which made estimating straight-forward by using monte-carlo estimates.
Let’s try to take the gradient of in a way similar to how we did :
Because isn’t necessarily a probability distribution, we can’t do the same step as we did in where we got an expectation immediately. If we can analytically calculate and summation, then in theory we can calculate directly.
But if we can’t calculate , then we have two options:
- The REINFORCE Estimator uses the log-derivate trick on the derivation above to actually derive an expectation, which we can then estimate numerically using law of large numbers, just like in .
- We can ALWAYS do the REINFORCE estimator and get an unbiased estimate of the gradient , but it can be high variance.
- The Reparamaterization trick rewrites into an equivalent expectation that only has the model parameters in the integrand (instead of in the probability distribution), which means we can just apply the same techniques we did for . This is done by reparametrizing the probability distribution and the integrand. This isn’t always possible (i.e., requires to belong to a simple reparameterizable family, like Gaussians), but when possible, can provide a low-variance estimator for the gradient.
REINFORCE Estimator
Let us continue the derivation for from above by using the log-derivative trick:
By using the log-derivative trick, we got back a , which we could then use to rewrite as an expectation. Notice that our new expectation still has the probability distribution depend on , but that is fine because it is exactly equal to what we want: .
Numerical calculation
This expectation is different from , but we can estimate it numerically the same way via law of large numbers:
Reparameterization trick
The reparameterization also rewrites as an expectation, but rather than try to continue on from our original derivation, it reparametrizes the probability distribution and the integrand by making a new probability distribution that doesn’t depend on the parameters , and introduces them into the integrand by using a deterministic transformation from the simple distribution to the original one.
Let’s assume that the probability distribution belongs to a simple parameterized family of probability distributions, like the family of gaussian: , meaning the parameters are the mean and standard deviation of and is a gaussian distribution.
The change of variables formula tells us that we can rewrite the probability density function using a different probability density function that doesn’t depend on the parameters, , where , as long as we account for volume expansion of the pdf due to (i.e., the determinant of the Jacobian). For example, we know that (see Example: Gaussians for derivation):
where .
What’s important about this second way of writing is that the probability density function we are using no longer depends on (since the mean and std are 0 and 1 respectively. However, the value we evaluate at and how we transform it (by dividing by ) now do depend on , where we did not have that in the first way of writing it : we just plugged in and returned the pdf for gaussian parameterized by .
Note that we can relate the infinitesimal and through the Jacobian:
Because of this relationship, we can now rewrite using the probability density function that doesn’t depend on the parameters to make it easy to take the gradient:
By reparameterizing the expectation to be using the random variable , we moved the from the probability distribution into the integrand. We could use other simple distributions instead of a gaussian as long as we can sample from it and transform the likelihoods between the simple distribution and original distribution (see Change of Variables).
Numerical calculation
We can now numerically estimate this gradient the same way we did for , which shows that we can move the gradient into the expectation (since only the integrand depends on ) and then estimate it numerically with law of large numbers:
- todo: show the computational graph for reparameterization trick, and how it makes it so gradient doesn’t backprop through stochastic sampling due to deterministic transformation.
- Prove this is lower variance than REINFORCE. Discuss importance of variance/bias (potentially another blog post).
3) : Probability P and F both depend on parameters
For , let’s follow a similar derivation for directly calculating the gradient like we did for and in the REINFORCE estimator for :
As we can see, for calculating , where both the probability and integrand depend on the parameters , we just need to sum the gradients for the two cases where only either or depend on .
Numerical calculation
This means that when we want to numerically calculate , we can calculate the left term using the one approach we discussed for , and for the right term, we can either use the REINFORCE estimator or the reparameterization trick:
Acknowledgements
I’d like to thank Zoheb Anjum for providing useful feedback on the note (fixing math and typos).
References
- Gregory blog
- shows derivation for derivative of expected value with parameter in distribution by using product rule, which is awesome / best way to describe how you can just “calculate gradient for a probability distribution”
- REINFORCE vs Reparameterization Trick