Reinforcement Learning
Imagine you want to go river fishing and you are given a fishing rod. You don't have any fish or any fishing skills. You can't afford to learn fishing from some expert right now. So All you can do is to cast your fishing rod and wait for the results. So after some waiting you might think of going some where alongside the river where the water is deeper and thus number of fishes in that area is more. so you have a bigger chance of getting any. By catching the first fish you find out that going to noted areas isn't such a bad idea. so by waiting long times in bad places to fish (negative reward usually small amount for each time step), you learn to stop fishing in those areas, and by catching fish in good places (usually a terminal state with high reward), you try to find what is good about this place (in out example the depth of the river).
In the previous doc we saw that in MDP we want to find the optimal actions in a world which by doing those actions we may reach the maximum sum of rewards over time. The main difference between MDP and RL is that in RL we are unaware of R(s,a,s') and T(s,a,s') and we must actually do the action to get the reward.
We can assume MDP is an offline method because we already know what are T and R. but in RL we have to actually do some actions. therefor it is an online method.
This type of learning assumes an approximate model for our problem and tries to solve it in two iterative model.
1) get some samples out of the environment. count outcome state for each starting state s and action a. Using maximum likelihood and normalization we can find T(s,a,s'). We also get R(s, a, s') for each sample.
2) solve the learned MDP. In this step we assume T(s,a,s') is calculated and fixed. We then use the same approach to sove MDP problems, i.e. value iteration.
We perform several similar step for above example. At each step we take some samples out of the environment. We apply these samples to our model; which in this case is an expected over all possible transitions between arbitary states of s and s', and find the most suited T(s,a,s') for this step. In above example for each aribtary state s we assign T(s,a,s') = count(S'=s'|S=s) / count(S=s). for example for s = C at episode 4 we have: T(C,east,D)= 3/4. It should be noted that by each sample we get the final s' and the corresponding reward R(s,a,s').
In this approach we only rely on the number of similar sample outputs to find T function. for example suppose that we want to find the average age of a group of people:
In this type we have an iterative approach for calculating the optimal policy. We assume that as the we are give the $\pi_0$ polcy and then we have two approaches:
We perform different episodes from different starting points and follow the agent through the path to terminal state (in this case D). Then we evaluate the $V^\pi$ function. Assume we want to calculate $V^\pi(B)$. Tracking all the episodes starting from B and ending in terminal state, we calculate the function by getting expected over these episodes rewards (in this case episode 1 and 2). $V^\pi(B) = ( (-1-1+10) + (-1-1+10) ) / 2 = 8$ similarly: $V^\pi(C) = ( 3 * (-1+10) + (-1-10) ) = 4 $
The problems with values iteration are:
In above example due to lack of sufficient number of examples we find different values for $V^\pi(B)$ and $V^\pi(C)$ though they are symmetric. It means we have to run a lot of episodes to reach the desired output.
$V^\pi_0 = 0$
$V^\pi_{k+1}(s) = \sum T(s,a,s')[R(s,a,s') + \gamma V^\pi_k(s')]$
This approach considers the connection between states. The question remains that "how to implement this bellman equation according to our environment?"
At step=k+1 we assume: $V^\pi_{k+1}(s) = \frac{1}{n} \sum sample_i$ where all $sample_i$ are starting at arbitary state s.
Big idea: learn from each experience according to to coefficient $\alpha$.
But why not learn from each sample equally? if we want to implement that, we maintain a counter for current state s and at updating we apply the $\alpha=\frac{1}{counter+1}$
sample of V(s): $sample = R(s,\pi(s),s') + \gamma V^\pi(s')$
Update to V(s): $V^\pi(s) <- V^\pi(s) + \alpha(sample-V^\pi(s))$
If we want to emphasize the importance of recent samples we can use a fixed $\alpha$ between 0 and 1.
$V^\pi(B) = (1-\frac{1}{2})*0 +\frac{1}{2} (-2+1*0) = -1$
$V^\pi(C) = (1-\frac{1}{2})*0 +\frac{1}{2} (-2+1*8) = -1$
poblems with temporal difference Although this method is based on simple bellman equation updates. but if we want to turn values into a new policy we are sunk. We solve this method by using Q values instead of value functions.
$\pi(s) = \underset{a}{argmax} Q(s,a)$
$Q(s,a) = \underset{s'}\sum T(s,a,s')[R(s,a,s')+\gamma V(s')]$
By using Q-values we have a method in RL named Q-Learning. this method is a sample-based q-value iteration method.
Q-Learning converges to optimal policy eventually even if you are acting sub-optimally. this is called off-policy learning.
In this method we still have to act to find the optimal value functions. but in this method the policy is not fixed and may change during the time of training.
There are two new terms defined in actve RL, exploration and exploitation. explorations refers to trying actions that are rarely done because they might have a bigger reward. exploitation refers to as we tried almost everything we should keep doing the optimal action at each state.
the simplest form of choosing between these two action is doing a $\epsilon -greedy$ action. which means with a probability of $\epsilon$ we do a random action (exploration) and with a probability of $(1-\epsilon)$ we do the current policy action (exploitation).
there is another logical way of doing this. we can count how many times we did some random action. if didn't do it much, we should try it more often and if we did it a lot and it doesn't return a good output we should just stop doing it.
Therefore our Q value update changes as below:
$Q(s,a) <- R(s,a,s') + \gamma \underset{a'} f(Q(s',a'),N(s',a'))$
in above equation f has two input v and n respectively and outputs f(v,n) = v + k/n. k is fixed. v is the optimistic utility in this case Q. and n is the number of times we visited s' after doing action a' starting from s. which means when the n is low we get to try those actions more often but in the end the choice relies on the utility function.
Almost all RL algorithms defined in above reach the optimum policy eventually. but the ones who reach the optimal policy later, will have a greater regret.
As in real-life RL environments the number of states are very large, we cannot explore all these states. but there are often symmetric or similar states which result in same output. In generlizing we try to find a way to avoid recalculating those states. One way of doing it is using a linear model for value functions instead of value function itself. In this method we try to learn features of each state and get a weighted sum of these featueres as the final value function.
$V(s) = w_1f_1(s) + w_2f_2(s) + ... + w_nf_n(s)$
$Q(s,a) = w_1f_1(s,a) + w_2f_2(s,a) + ... + w_nf_n(s,a)$
there is a tradeoff between doing less computation and miscalculating the value of similar states but with different values.
in this method instead of updating Q values we update $w_i$s.
if we assume to have SSE as defined below:
$total error = \underset{i} \sum (y_i - \hat{y_i} ) ^ 2 = \underset{i} \sum ( y_i - \underset{k} \sum w_kf_k(x_i))^2$
Imagine we have only one point x, with f(x) features, target value y, and weights w:
$error(w) = \frac{1}{2} ( y - \underset{k} \sum w_kf_k(x) )^2$
$ \frac{\partial error(w)}{\partial w_m} = -(y-\underset{k} \sum w_kf_k(x))f_m(x) $
$ w_m <- w_m + \alpha ( y - \underset{k} \sum w_kf_k(x) ) f_m(x) $
then the approximate q update will be:
$w_m <- w_m + \alpha [ r + \gamma \underset{a'} maxQ(s',a') - Q(s,a) ] f_m(s,a)$