Q Learning#

What is a Q function?#

A Q function is a more complicated version of a value function. In RL, a value function \( v(s) \) tries to approximate all the future rewards a state \( s \) will get. A Q function basically does the same thing, but with a twist.

Remember that in RL, state transition happens when an agent, standing in a state \( s^1 \), takes an action \( a \), and ends up in another state \( s^2 \)? We notice that instead of directly approximating the value of \( s^2 \), which is \( v(s^2) \), we could use a Q function, which takes \( s^1 \) and \( a \) as parameters, and define the q function \( Q(s^1, a) = v(s^2) \) given that \( (s^1, a) \rightarrow s^2 \).

So, compare to value function \( v(s) \) that takes in a state \( s \) as input and outputs a scalar, a Q function \( Q(s) \) that takes in \( s \) as input will output a vector of possible rewards corresponding to possible states we will end up in after taking each possible actions.

How are Q functions better?#

Q functions are better when the number of possible actions are pre-determined, and when we can generate a batch of values faster than we can generate them one-by-one. Because we already know all the possible actions, we could generate a vector faster than a value function, which has to take in the next states one-by-one, which takes precious computational time.

Q learning in simple terms.#

Warning

Incoming math!

In Q-learning, we often make use of an algorithm called \( \epsilon \)-greedy. What the algorithm does is that given an epsilon that’s in the range \( [0,1] \), \( \epsilon \) is the probability that we randomly select an action, or else we greedily choose the best value of all possible actions.

Remember that the Q function \( Q(s) \) outputs the future rewards associated with each state that this action can transition to? Suppose that our Q function is accurate enough, \( \arg_a \max Q(s) \) is the best action to take in all possible actions.

Imagine that we are in a state \( s^1 \), after taking the action \( \arg_a \max Q(s) \), we arrive at the state \( s^2 \). If our Q function is accurate, then we have

\[ Q(s^1, a) = \gamma R_a + \arg_a \max Q(s^2) \]

because of the RL equation that

\[ G_t = \gamma R_a + G_{t+1} \]

So the update rules are easy. We want to update

\[ Q(s^1, a) \]

to be as close to

\[ \gamma R_a + \arg_a \max Q(s^2) \]

as possible (but not vice versa! The reason is a little big complicated. See Sutton and Barto’s Intro to RL book).

Then we have the update rule:

\[ Q(s^1, a) = Q(s^1, a) - \eta \nabla L(Q(s^1, a), \gamma R_a + \arg_a \max Q(s^2)) \]

where \( L \) is the loss function.