Policy Gradient#
Where can policy gradients be applied?#
Policy gradient is specifically used to update policy networks that outputs probability for each actions.
Warning
Incoming math!
The policy gradient in simple terms.#
First, we have a policy network \( \pi \), the current state \( s \), reward for each action \( r_i \), and probability for each actions \( \pi(s)_i \).
The expected future reward (value function) of this state is thus
To maximize the value function of this state, we wish to find the gradient of the expected future reward \( \nabla G \), which is
Because rewards are scalars and determined by the environment,
Since \( \pi(s)_i \) is the probability of each action (which usually is non-zero), we divide and multiply by it in the previous equation
We notice that this looks like the formula for expectation! So the equation is reduced to:
Which is equivalent to
Voila! This is the reason you see log-probability quite often in policy gradients.