Actor Critic#

What is actor critic?#

Actor critic sounds cool, but it’s nothing special. Remember that policy networks outputs probability? The probability is generated by taking softmax (see the previous chapters about softmax), over a vector of scalars, which is usually called logits. Here’s an interesting fact about logits: because softmax is applied on logits, adding a scalar to all the logits doesn’t change the softmax output!

Warning

Incoming math!

Because the definition of softmax is

\[ \frac{e^{x_i}}{\sum_j e^{x_j}} \]

You notice that adding a scalar to all \( x_i \) is equivalent to

\[ \frac{e^{x_i + s}}{\sum_j e^{x_j + s}} \]

which can be written as

\[ \frac{e^{x_i} e^s}{\sum_j e^{x_j} e^s} \]

Which is reduced to the original definition of softmax because \( e^s \) cancels each other.

\[ \frac{e^{x_i}}{\sum_j e^{x_j}} \]

What actor critic does is essentially this, it uses a policy network to generate logits, and uses a value network to generate a scalar. It then subtracts the scalar from the logits, and call the result advantage. The scalar acts similarly to the mean of logits, and although it doesn’t change the output of the policy model (because the logits are eventually passed into a softmax layer), it helps reduce the variance of the model and makes the model more robust.

Proximal Policy Optimization.#

PPO is one of the most famous actor critic methods, developed by OpenAI. PPO is an offline RL optimization method. It follows the principle that the data-collecting agent (that’s not updated) should not be too different from the agent being trained (updated). The PPO algorithm optimizes the model in the following steps:

  1. Collects a batch of trajectories (data).

  2. Re-weights the probabilities so that the data-collecting agent’s rewards are re-scaled according to the probability ratio (because of how policy based methods take the expectation of future reward, see the policy section).

  3. Clip the re-scaling factor because the new model shouldn’t be too different from the old model (which would increase variance).

  4. Apply normal policy gradient methods.

  5. Repeat.