Summary

We will consider taking the gradients of three different expressions w.r.t $θ$ :

L_{1} (θ) L_{2} (θ) L_{3} (θ) = E_{x \sim P} [F_{θ} (x)] = E_{x \sim P_{θ}} [F (x)] = E_{x \sim P_{θ}} [F_{θ} (x)]

All three of these expressions are Expected values over the same two terms $F$ and $P$ (representing the integrand and probability distribution respectively), but the difference between these expressions is which of these terms depend on $θ$ .

This note derives the gradients for each of these objectives, with final forms shown below:

\nabla_{θ} L_{1} (θ) \nabla_{θ} L_{2} (θ) \nabla_{θ} L_{3} (θ) = E_{x \sim P} [\nabla_{θ} F_{θ} (x)] = E_{x \sim P_{θ} (x)} [F (x) \nabla_{θ} lo g P_{θ} (x)] = E_{z \sim N (0, I)} [\nabla_{θ} F (g_{θ} (z))] = \nabla_{θ} L_{1} (θ) + \nabla_{θ} L_{2} (θ) (REINFORCE) (Reparameterization trick)

Motivation

In $L_{1}$ , only the argument to the expected value $F$ depends on $θ$ .
- Example: In supervised learning, $θ$ represents the parameters to a model, $P$ is a fixed dataset (and therefore does not change as we change our model parameters), and $F_{θ} (x)$ is a loss function on $θ$ for a sample from the dataset.
- Calculating $\nabla_{θ} L_{1} (θ)$ lets us use gradient descent to iteratively update the model weights $θ$ to better fit the dataset.
In $L_{2}$ , only the probability to the expected value $P$ depends on $θ$ .
- Example: In Reinforcement Learning, the Value function for a policy $π_{θ}$ in a given state $s$ is an expected value. The probability of a trajectory $P_{θ} (τ)$ depends on the policy parameters, and the integrand $G (τ) = F$ is the cumulative discounted rewards of a trajectory and therefore does not depend on the policy parameters $θ$ .
- Calculating $\nabla_{θ} L_{2} (θ)$ is called the policy gradients and lets us use gradient descent to iteratively update the policy parameters $θ$ to maximize cumulative discounted expected rewards.
In $L_{3}$ , both the probability $P$ and the integrand $F_{θ}$ depend on $θ$ :
- Example: In maximum entropy reinforcement learning, in addition to maximizing returns $G (τ)$ , we also want to maximize the entropy of the policy, $H (π_{θ})$ . So now both the probability distribution $P_{θ} (τ)$ and the integrand $F_{θ} (τ) = G (τ) + H (π_{θ})$ depend on the policy parameters $θ$ .
- Similar to the $L_{2}$ setting, calculating $\nabla_{θ} L_{3} (θ)$ is the policy gradient for a “fancier” objective than the standard RL one, and lets us iteratively update policy parameters $θ$ .

We will now address in turn how to calculate the gradients $\nabla_{θ} L_{1} (θ), \nabla_{θ} L_{2} (θ), \nabla_{θ} L_{3} (θ)$

1) $\nabla_{θ} L_{1}$ : Integrand F depends on parameters

L_{1} (θ) \nabla_{θ} L_{1} (θ) = E_{x \sim P} [F_{θ} (x)] = x \sum F_{θ} (x) P (x) = \nabla_{θ} E_{x \sim P} [F_{θ} (x)] = \nabla_{θ} x \sum F_{θ} (x) P (x) = x \sum \nabla_{θ} (F_{θ} (x) P (x)) = x \sum \nabla_{θ} F_{θ} (x) P (x) + F_{θ} (x) \nabla_{θ} P (x) = x \sum \nabla_{θ} F_{θ} (x) P (x) = E_{x \sim P} [\nabla_{θ} F_{θ} (x)] (Linearity of gradient) (Product rule) (F_{θ} (x) \nabla_{θ} P (x) = 0 since \nabla_{θ} P (x) = 0)

Numerical calculation

Because $\nabla_{θ} L_{1} (θ) = E_{x \sim P} [\nabla_{θ} F_{θ} (x)]$ , we can approximate the gradient of $L_{1}$ w.r.t $θ$ by using the law of large numbers and use the empirical average to estimate the expectation by taking an average of the gradients over the dataset:

\nabla_{θ} L_{1} (θ) = E_{x \sim P} [\nabla_{θ} F_{θ} (x)] \approx \frac{1}{n} i = 1 \sum n \nabla_{θ} F_{θ} (x_{i})

2) $\nabla_{θ} L_{2}$ : Probability P depends on parameters

For $L_{1} (θ)$ , we just calculated the gradient directly, and ended up with an expression that was an expectation, which made estimating $\nabla_{θ} L_{1}$ straight-forward by using monte-carlo estimates.

Let’s try to take the gradient of $L_{2}$ in a way similar to how we did $L_{1}$ :

L_{2} (θ) \nabla_{θ} L_{2} (θ) = E_{x \sim P_{θ}} [F (x)] = x \sum F (x) P_{θ} (x) = \nabla_{θ} E_{x \sim P_{θ}} [F (x)] = \nabla_{θ} x \sum F (x) P_{θ} (x) = x \sum \nabla_{θ} (F (x) P_{θ} (x)) = x \sum \nabla_{θ} F (x) P_{θ} (x) + F (x) \nabla_{θ} P_{θ} (x) = x \sum F (x) \nabla_{θ} P_{θ} (x) (Linearity of gradient) (Product rule) (\nabla_{θ} F (x) P_{θ} (x) = 0 since \nabla_{θ} F (x) = 0)

Because $F (x)$ isn’t necessarily a probability distribution, we can’t do the same step as we did in $\nabla_{θ} L_{1}$ where we got an expectation immediately. If we can analytically calculate $\nabla_{θ} P_{θ} (x)$ and summation, then in theory we can calculate $\nabla_{θ} L_{2}$ directly.

But if we can’t calculate $\nabla_{θ} P_{θ} (x)$ , then we have two options:

The REINFORCE Estimator uses the log-derivate trick on the derivation above to actually derive an expectation, which we can then estimate numerically using law of large numbers, just like in $\nabla_{θ} L_{1}$ .
1. We can ALWAYS do the REINFORCE estimator and get an unbiased estimate of the gradient $\nabla_{θ} L_{2}$ , but it can be high variance.
The Reparamaterization trick rewrites $L_{2} = E_{x \sim P_{θ}} [F (x)]$ into an equivalent expectation that only has the model parameters $θ$ in the integrand (instead of in the probability distribution), which means we can just apply the same techniques we did for $L_{1}$ . This is done by reparametrizing the probability distribution and the integrand. This isn’t always possible (i.e., requires $P_{θ}$ to belong to a simple reparameterizable family, like Gaussians), but when possible, can provide a low-variance estimator for the gradient.

REINFORCE Estimator

Let us continue the derivation for $\nabla_{θ} L_{2}$ from above by using the log-derivative trick:

\nabla_{θ} L_{2} (θ) = x \sum F (x) \nabla_{θ} P_{θ} (x) = x \sum F (x) P_{θ} (x) \nabla_{θ} lo g P_{θ} (x) = E_{x \sim P_{θ} (x)} [F (x) \nabla_{θ} lo g P_{θ} (x)] (Log-derivative trick)

By using the log-derivative trick, we got back a $P_{θ} (x)$ , which we could then use to rewrite $\nabla_{θ} L_{2}$ as an expectation. Notice that our new expectation still has the probability distribution depend on $θ$ , but that is fine because it is exactly equal to what we want: $\nabla_{θ} L_{2}$ .

Numerical calculation

This expectation is different from $\nabla_{θ} L_{1}$ , but we can estimate it numerically the same way via law of large numbers:

\nabla_{θ} L_{2} (θ) = E_{x \sim P_{θ} (x)} [F (x) \nabla_{θ} lo g P_{θ} (x)] \approx \frac{1}{n} i = 1 \sum n F (x_{i}) \nabla_{θ} lo g P_{θ} (x_{i})

Reparameterization trick

The reparameterization also rewrites $\nabla_{θ} L_{2}$ as an expectation, but rather than try to continue on from our original derivation, it reparametrizes the probability distribution and the integrand by making a new probability distribution that doesn’t depend on the parameters $θ$ , and introduces them into the integrand by using a deterministic transformation from the simple distribution to the original one.

Let’s assume that the probability distribution $P_{θ} (x)$ belongs to a simple parameterized family of probability distributions, like the family of gaussian: $P_{θ} (x) = N (x ∣ μ_{θ}, σ_{θ}^{2})$ , meaning the parameters $θ$ are the mean and standard deviation of $P_{θ}$ and is a gaussian distribution.

The change of variables formula tells us that we can rewrite the probability density function $P_{θ} (x)$ using a different probability density function that doesn’t depend on the parameters, $P_{Z} (z) = N (0, 1)$ , where $x = g_{θ} (z)$ , as long as we account for volume expansion of the pdf due to $g$ (i.e., the determinant of the Jacobian). For example, we know that (see Example: Gaussians for derivation):

P_{θ} (x) = N (x ∣ μ_{θ}, σ_{θ}^{2}) = \frac{N ( \frac{( x - μ _{θ} )}{σ _{θ}} ∣ 0 , 1 )}{∣ σ _{θ} ∣} = \frac{N ( g _{θ}^{- 1} ( x ) ∣ 0 , 1 )}{∣ σ _{θ} ∣} = \frac{N ( z ∣ 0 , 1 )}{∣ σ _{θ} ∣} = \frac{P _{Z} ( z )}{∣ σ _{θ} ∣}

where $g_{θ} (z) = σ_{θ} z + μ_{θ}$ .

What’s important about this second way of writing $P_{θ} (x) = \frac{N ( \frac{( x - μ _{θ} )}{σ _{θ}} ∣ 0 , 1 )}{∣ σ _{θ} ∣}$ is that the probability density function we are using no longer depends on $θ$ (since the mean and std are 0 and 1 respectively. However, the value we evaluate at $z = g_{θ}^{- 1} (x) = \frac{x - μ _{θ}}{σ _{θ}}$ and how we transform it (by dividing by $σ_{θ}$ ) now do depend on $θ$ , where we did not have that in the first way of writing it $N (x ∣ μ_{θ}, σ_{θ}^{2})$ : we just plugged in $x$ and returned the pdf for gaussian parameterized by $μ_{θ}, σ_{θ}$ .

Note that we can relate the infinitesimal $d x$ and $d z$ through the Jacobian:

\frac{d x}{d z} d x = J_{g_{θ}} =∣ det J_{g_{θ}} ∣ d z

Because of this relationship, we can now rewrite $L_{2}$ using the probability density function that doesn’t depend on the parameters $θ$ to make it easy to take the gradient:

L_{2} (θ) L_{2} (θ) = E_{x \sim P_{θ}} [F (x)] = \int_{X} F (x) P_{θ} (x) d x = \int_{Z} F (g_{θ} (z)) P_{θ} (g_{θ} (z)) ∣ det J_{g_{θ}} ∣ d z = \int_{Z} F (g_{θ} (z)) \frac{N ( z ∣0 , 1 )}{∣ det J _{g_{θ}} ∣} ∣ det J_{g_{θ}} ∣ d z = \int_{Z} F (g_{θ} (z)) N (z ∣0, 1) d z = E_{z \sim N (0, 1)} [F (g_{θ} (z))] (Change variables) (Reparameterize pdf) (Cancel det J)

By reparameterizing the expectation to be using the random variable $x^{'} \sim N (0, I)$ , we moved the $θ$ from the probability distribution into the integrand. We could use other simple distributions instead of a gaussian as long as we can sample from it and transform the likelihoods between the simple distribution and original distribution (see Change of Variables for PDFs (Probability)).

Numerical calculation

We can now numerically estimate this gradient the same way we did for $\nabla_{θ} L_{1}$ , which shows that we can move the gradient into the expectation (since only the integrand depends on $θ$ ) and then estimate it numerically with law of large numbers:

L_{2} (θ) \nabla_{θ} L_{2} (θ) = E_{z \sim N (0, I)} [F (g_{θ} (z))] = \nabla_{θ} E_{z \sim N (0, I)} [F (g_{θ} (z))] = E_{z \sim N (0, I)} [\nabla_{θ} F (g_{θ} (z))] \approx \frac{1}{n} i = 1 \sum n \nabla_{θ} F (g_{θ} (z_{i}))

todo: show the computational graph for reparameterization trick, and how it makes it so gradient doesn’t backprop through stochastic sampling due to deterministic transformation.
Prove this is lower variance than REINFORCE. Discuss importance of variance/bias (potentially another blog post).

3) $\nabla_{θ} L_{3}$ : Probability P and F both depend on parameters

For $\nabla_{θ} L_{3}$ , let’s follow a similar derivation for directly calculating the gradient like we did for $\nabla_{θ} L_{1}$ and in the REINFORCE estimator for $\nabla_{θ} L_{2}$ :

L_{3} (θ) \nabla_{θ} L_{3} (θ) = E_{x \sim P_{θ}} [F_{θ} (x)] = x \sum F_{θ} (x) P_{θ} (x) = \nabla_{θ} E_{x \sim P_{θ}} [F_{θ} (x)] = \nabla_{θ} x \sum F_{θ} (x) P_{θ} (x) = x \sum \nabla_{θ} (F_{θ} (x) P_{θ} (x)) = x \sum \nabla_{θ} F_{θ} (x) P_{θ} (x) + F_{θ} (x) \nabla_{θ} P_{θ} (x) = x \sum \nabla_{θ} F_{θ} (x) P_{θ} (x) + x \sum F_{θ} (x) \nabla_{θ} P_{θ} (x) = E_{x \sim P (x)} [\nabla_{θ} F_{θ} (x)] + x \sum F_{θ} (x) \nabla_{θ} P_{θ} (x) = \nabla_{θ} E_{x \sim P (x)} [F_{θ} (x)] + x \sum F_{θ} (x) \nabla_{θ} P_{θ} (x) = \nabla_{θ} E_{x \sim P (x)} [F_{θ} (x)] + \nabla_{θ} E_{x \sim P_{θ} (x)} [F (x)] = \nabla_{θ} L_{1} (θ) + \nabla_{θ} L_{2} (θ) (Linearity of gradient) (Product rule)

As we can see, for calculating $\nabla_{θ} L_{3}$ , where both the probability $P_{θ}$ and integrand $F_{θ}$ depend on the parameters $θ$ , we just need to sum the gradients for the two cases where only either $F$ or $P$ depend on $θ$ .

Numerical calculation

This means that when we want to numerically calculate $\nabla_{θ} L_{3}$ , we can calculate the left term using the one approach we discussed for $\nabla_{θ} L_{1}$ , and for the right term, we can either use the REINFORCE estimator or the reparameterization trick:

\nabla_{θ} L_{3} (θ) = \nabla_{θ} L_{1} (θ) + \nabla_{θ} L_{2} (θ) \approx \frac{1}{n} i = 1 \sum n \nabla_{θ} F_{θ} (x_{i}) + \frac{1}{n} i = 1 \sum n F (x_{i}) \nabla_{θ} lo g P_{θ} (x_{i}) \approx \frac{1}{n} i = 1 \sum n \nabla_{θ} F_{θ} (x_{i}) + \frac{1}{n} i = 1 \sum n \nabla_{θ} F (g_{θ} (x_{i}^{'})) where x_{i} \sim P_{θ}, x_{i}^{'} \sim N (0, I) (REINFORCE) (Reparameterization trick)

Acknowledgements

I’d like to thank Zoheb Anjum for providing useful feedback on the note (fixing math and typos).

References

Gregory blog
- shows derivation for derivative of expected value with parameter in distribution by using product rule, which is awesome / best way to describe how you can just “calculate gradient for a probability distribution”
REINFORCE vs Reparameterization Trick

Continually Learning Blog

Explorer

Gradients of random variables

Summary

Motivation

1) $\nabla_{θ} L_{1}$ : Integrand F depends on parameters

Numerical calculation

2) $\nabla_{θ} L_{2}$ : Probability P depends on parameters

REINFORCE Estimator

Numerical calculation

Reparameterization trick

Numerical calculation

3) $\nabla_{θ} L_{3}$ : Probability P and F both depend on parameters

Numerical calculation

Acknowledgements

References

Graph View

Table of Contents

Backlinks

Continually Learning Blog

Explorer

Gradients of random variables

Summary

Motivation

1) ∇θ​L1​: Integrand F depends on parameters

Numerical calculation

2) ∇θ​L2​: Probability P depends on parameters

REINFORCE Estimator

Numerical calculation

Reparameterization trick

Numerical calculation

3) ∇θ​L3​: Probability P and F both depend on parameters

Numerical calculation

Acknowledgements

References

Graph View

Table of Contents

Backlinks

1) $\nabla_{θ} L_{1}$ : Integrand F depends on parameters

2) $\nabla_{θ} L_{2}$ : Probability P depends on parameters

3) $\nabla_{θ} L_{3}$ : Probability P and F both depend on parameters