Q Learning#
What is a Q function?#
A Q function is a more complicated version of a value function. In RL, a value function \( v(s) \) tries to approximate all the future rewards a state \( s \) will get. A Q function basically does the same thing, but with a twist.
Remember that in RL, state transition happens when an agent, standing in a state \( s^1 \), takes an action \( a \), and ends up in another state \( s^2 \)? We notice that instead of directly approximating the value of \( s^2 \), which is \( v(s^2) \), we could use a Q function, which takes \( s^1 \) and \( a \) as parameters, and define the q function \( Q(s^1, a) = v(s^2) \) given that \( (s^1, a) \rightarrow s^2 \).
So, compare to value function \( v(s) \) that takes in a state \( s \) as input and outputs a scalar, a Q function \( Q(s) \) that takes in \( s \) as input will output a vector of possible rewards corresponding to possible states we will end up in after taking each possible actions.
How are Q functions better?#
Q functions are better when the number of possible actions are pre-determined, and when we can generate a batch of values faster than we can generate them one-by-one. Because we already know all the possible actions, we could generate a vector faster than a value function, which has to take in the next states one-by-one, which takes precious computational time.
Q learning in simple terms.#
Warning
Incoming math!
In Q-learning, we often make use of an algorithm called \( \epsilon \)-greedy. What the algorithm does is that given an epsilon that’s in the range \( [0,1] \), \( \epsilon \) is the probability that we randomly select an action, or else we greedily choose the best value of all possible actions.
Remember that the Q function \( Q(s) \) outputs the future rewards associated with each state that this action can transition to? Suppose that our Q function is accurate enough, \( \arg_a \max Q(s) \) is the best action to take in all possible actions.
Imagine that we are in a state \( s^1 \), after taking the action \( \arg_a \max Q(s) \), we arrive at the state \( s^2 \). If our Q function is accurate, then we have
because of the RL equation that
So the update rules are easy. We want to update
to be as close to
as possible (but not vice versa! The reason is a little big complicated. See Sutton and Barto’s Intro to RL book).
Then we have the update rule:
where \( L \) is the loss function.