Policy Gradient#

Where can policy gradients be applied?#

Policy gradient is specifically used to update policy networks that outputs probability for each actions.

Warning

Incoming math!

The policy gradient in simple terms.#

First, we have a policy network \( \pi \), the current state \( s \), reward for each action \( r_i \), and probability for each actions \( \pi(s)_i \).

The expected future reward (value function) of this state is thus

\[ G = \sum_i \pi(s)_i r_i \]

To maximize the value function of this state, we wish to find the gradient of the expected future reward \( \nabla G \), which is

\[ \nabla \sum_i \pi(s)_i r_i \]

Because rewards are scalars and determined by the environment,

\[ \nabla G = \sum_i r_i \nabla \pi(s)_i \]

Since \( \pi(s)_i \) is the probability of each action (which usually is non-zero), we divide and multiply by it in the previous equation

\[ \nabla G = \sum_i r_i \frac{ \nabla \pi(s)_i }{ \pi(s)_i } \pi(s)_i \]

We notice that this looks like the formula for expectation! So the equation is reduced to:

\[ \nabla G = E_{\sim\pi(s)}[ r_i \frac{ \nabla \pi(s)_i }{ \pi(s)_i } ] \]

Which is equivalent to

\[ \nabla G = E_{\sim\pi(s)}[ r_i \nabla \log \pi(s)_i ] \]

Voila! This is the reason you see log-probability quite often in policy gradients.