Reinforcement learning is a domain in machine learning that introduces the concept of an agent learning optimal strategies in complex environments. The agent learns from its actions, which result in rewards, based on the environment’s state. Reinforcement learning is a challenging topic and differs significantly from other areas of machine learning.
What is remarkable about reinforcement learning is that the same algorithms can be used to enable the agent adapt to completely different, unknown, and complex conditions.
Note. To fully understand the concepts included in this article, it is highly recommended to be familiar with dynamic programming and Monte Carlo methods discussed in previous articles.
- In part 2, we explored the dynamic programming (DP) approach, where the agent iteratively updates V- / Q-functions and its policy based on previous calculations, replacing them with new estimates.
- In parts 3 and 4, we introduced Monte Carlo (MC) methods, where the agent learns from experience acquired by sampling episodes.
Temporal-difference (TD) learning algorithms, on which we will focus in this article, combine principles from both of these apporaches:
- Similar to DP, TD algorithms update estimates based on the information of previous estimates. As seen in part 2, state updates can be performed without updated values of other states, a technique known as bootstrapping, which is a key feature of DP.
- Similar to MC, TD algorithms do not require knowledge of the environment’s dynamics because they learn from experience as well.
This article is based on Chapter 6 of the book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.
As we already know, Monte Carlo algorithms learn from experience by generating an episode and observing rewards for every visited state. State updates are performed only after the episode ends.
Temporal-difference algorithms operate similarly, with the only key difference being that they do not wait until the end of episodes to update states. Instead, the updates of every state are performed after n time steps the state was visited (n is the algorithm’s parameter). During these observed n time steps, the algorithm calculates the received reward and uses that information to update the previously visited state.
Temporal-difference algorithm performing state updates after n time steps is denoted as TD(n).
The simplest version of TD performs updates in the next time step (n = 1), known as one-step TD.
At the end of the previous part, we introduced the constant-α MC algorithm. It turns out that the pseudocode for one-step TD is almost identical, except for the state update, as shown below:
Since TD methods do not wait until the end of the episode and make updates using current estimates, they are said to use bootstrapping, like DP algorithms.
The expression in the brackets in the update formula is called TD error:
In this equation, γ is the discount factor which takes values between 0 and 1 and defines the importance weight of the current reward compared to future rewards.
TD error plays an important role. As we will see later, TD algorithms can be adapted based on the form of TD error.
At first sight, it might seem unclear how using information only from the current transition reward and the state values of the current and next states can be indeed beneficial for optimal strategy search. It will be easier to understand if we take a look at an example.
Let us imagine a simplified version of the famous “Copa America” soccer tournament, which regularly takes place in South America. In our version, in every Copa America tournament, our team faces 6 opponents in the same order. Through the system is not real, we will omit complex details to better understand the example.
We would like to create an algorithm that will predict our team’s total goal difference after a sequence of matches. The table below shows the team’s results obtained in a recent edition of the Copa America.
To better dive into the data, let us visualize the results. The initial algorithm estimates are shown by the yellow line in the diagram below. The obtained cumulative goal difference (last table column) is depicted in black.
Roughly speaking, our objective is to update the yellow line in a way that will better adapt changes, based on the recent match results. For that, we will compare how constant-α Monte Carlo and one-step TD algorithms cope with this task.
Constant-α Monte Carlo
The Monte Carlo method calculates the cumulative reward G of the episode, which is in our case is the total goal difference after all matches (+3). Then, every state is updated proportionally to the difference between the total episode’s reward and the current state’s value.
For instance, let us take the state after the third match against Peru (we will use the learning rate α = 0.5)
- The initial state’s value is v = 1.2 (yellow point corresponding to Chile).
- The cumulative reward is G = 3 (black dashed line).
- The difference between the two values G–v = 1.8 is then multiplied by α = 0.5 which gives the update step equal to Δ = 0.9 (red arrow corresponding to Chile).
- The new value’s state becomes equal to v = v + Δ = 1.2 + 0.9 = 2.1 (red point corresponding to Chile).
One-step TD
For the example demonstration, we will take the total goal difference after the fourth match against Chile.
- The initial state’s value is v[t] = 1.5 (yellow point corresponding to Chile).
- The next state’s value is v[t+1]= 2.1 (yellow point corresponding to Ecuador).
- The difference between consecutive state values is v[t+1]–v[t] = 0.6 (yellow arrow corresponding to Chile).
- Since our team won against Ecuador 5 : 0, then the transition reward from state t to t + 1 is R = 5 (black arrow corresponding to Ecuador).
- The TD error measures how much the obtained reward is bigger in comparison to the state values’ difference. In our case, TD error = R –(v[t+1]–v[t]) = 5–0.6 = 4.4 (red transparent arrow corresponding to Chile).
- The TD error is multiplied by the learning rate a = 0.5 which leads to the update step β = 2.2 (red arrow corresponding to Chile).
- The new state’s value is v[t] = v[t] + β = 1.5 + 2.2 = 3.7 (red point corresponding to Chile).
Comparison
Convergence
We can clearly see that the Monte Carlo algorithm pushes the initial estimates towards the episode’s return. At the same time, one-step TD uses bootstrapping and updates every estimate with respect to the next state’s value and its immediate reward which generally makes it quicker to adapt to any changes.
For instance, let us take the state after the first match. We know that in the second match our team lost to Argentina 0 : 3. However, both algorithms react absolutely differently to this scenario:
- Despite the negative result, Monte Carlo only considers the overall goal difference after all matches and pushes the current state’s value up which is not logical.
- One-step TD, on the other hand, takes into account the obtained result and instanly updates the state’s value down.
This example demonstrates that in the long term, one-step TD performs more adaptive updates, leading to the better convergence rate than Monte Carlo.
The theory guarantees convergence to the correct value function in TD methods.
Update
- Monte Carlo requires the episode to be ended to ultimately make state updates.
- One step-TD allows updating the state immediately after receiving the action’s reward.
In many cases, this update aspect is a significant advantage of TD methods because, in practice, episodes can be very long. In that case, in Monte Carlo methods, the entire learning process is delayed until the end of an episode. That is why TD algorithms learn faster.
After covering the basics of TD learning, we can now move on to concrete algorithm implementations. In the following sections we will focus on the three most popular TD variations:
- Sarsa
- Q-learning
- Expected Sarsa
As we learned in the introduction to Monte Carlo methods in part 3, to find an optimal strategy, we need to estimate the state-action function Q rather than the value function V. To accomplish this effectively, we adjust the problem formulation by treating state-action pairs as states themselves. Sarsa is an algorithm that opeates on this principle.
To perform state updates, Sarsa uses the same formula as for one-step TD defined above, but this time it replaces the variable with the Q-function values:
The Sarsa name is derived by its update rule which uses 5 variables in the order: (S[t], A[t], R[t + 1], S[t + 1], A[t + 1]).
Sarsa control operates similarly to Monte Carlo control, updating the current policy greedily with respect to the Q-function using ε-soft or ε-greedy policies.
Sarsa in an on-policy method because it updates Q-values based on the current policy followed by the agent.
Q-learning is one of the most popular algorithms in reinforcement learning. It is almost identical to Sarsa except for the small change in the update rule:
The only difference is that we replaced the next Q-value by the maximum Q-value of the next state based on the optimal action that leads to that state. In practice, this substitution makes Q-learning is more performant than Sarsa in most problems.
At the same time, if we carefully observe the formula, we can notice that the entire expression is derived from the Bellman optimality equation. From this perspective, the Bellman equation guarantees that the iterative updates of Q-values lead to their convergence to optimal Q-values.
Q-learning is an off-policy algorithm: it updates Q-values based on the best possible decision that can be taken without considering the behaviour policy used by the agent.
Expected Sarsa is an algorithm derived from Q-learning. Instead of using the maximum Q-value, it calculates the expected Q-value of the next action-state value based on the probabilities of taking each action under the current policy.
Compared to normal Sarsa, Expected Sarsa requires more computations but in return, it takes into account more information at every update step. As a result, Expected Sarsa mitigates the impact of transition randomness when selecting the next action, particularly during the initial stages of learning. Therefore, Expected Sarsa offers the advantage of greater stability across a broader range of learning step-sizes α than normal Sarsa.
Expected Sarsa is an on-policy method but can be adapted to an off-policy variant simply by employing separate behaviour and target policies for data generation and learning respectively.
Up until this article, we have been discussing a set algorithms, all of which utilize the maximization operator during greedy policy updates. However, in practice, the max operator over all values leads to overestimation of values. This issue particularly arises at the start of the learning process when Q-values are initialized randomly. Consequently, calculating the maximum over these initial noisy values often results in an upward bias.
For instance, imagine a state S where true Q-values for every action are equal to Q(S, a) = 0. Due to random initialization, some initial estimations will fall below zero and another part will be above 0.
- The maximum of true values is 0.
- The maximum of random estimates is a positive value (which is called maximization bias).
Example
Let us consider an example from the Sutton and Barto book where maximization bias becomes a problem. We are dealing with the environment shown in the diagram below where C is the initial state, A and D are terminal states.
The transition reward from C to either B or D is 0. However, transitioning from B to A results in a reward sampled from a normal distribution with a mean of -0.1 and variance of 1. In other words, this reward is negative on average but can occasionally be positive.
Basically, in this environment the agent faces a binary choice: whether to move left or right from C. The expected return is clear in both cases: the left trajectory results in an expected return G = -0.1, while the right path yields G = 0. Clearly, the optimal strategy consists of always going to the right side.
On the other hand, if we fail to address the maximization bias, then the agent is very likely to prioritize the left direction during the learning process. Why? The maximum calculated from the normal distribution will result in positive updates to the Q-values in state B. As a result, when the agent starts from C, it will greedily choose to move to B rather than to D, whose Q-value remains at 0.
To gain a deeper understanding of why this happens, let us perform several calculations using the folowing parameters:
- learning rate α = 0.1
- discount rate γ = 0.9
- all initial Q-values are set to 0.
Iteration 1
In the first iteration, the Q-value for going to B and D are both equal to 0. Let us break the tie by arbitrarily choosing B. Then, the Q-value for the state (C, ←) is updated. For simplicity, let us assume that the maximum value from the defined distribution is a finite value of 3. In reality, this value is greater than 99% percentile of our distribution:
The agent then moves to A with the sampled reward R = -0.3.
Iteration 2
The agent reaches the terminal state A and a new episode begins. Starting from C, the agent faces the choice of whether to go to B or D. In our conditions, with an ε-greedy strategy, the agent will almost opt going to B:
The analogous update is then performed on the state (C, ←). Consequently, its Q-value gets only bigger:
Despite sampling a negative reward R = -0.4 and updating B further down, it does not alter the situation because the maximum always remains at 3.
The second iteration terminates and it has only made the left direction more prioritized for the agent over the right one. As a result, the agent will continue making its initial moves from C to the left, believing it to be the optimal choice, when in fact, it is not.
One the most elegant solutions to eliminate maximization bias consists of using the double learning algorithm, which symmetrically uses two Q-function estimates.
Suppose we need to determine the maximizing action and its corresponding Q-value to perform an update. The double learning approach operates as follows:
- Use the first function Q₁ to find the maximizing action a⁎ = argmaxₐQ₁(a).
- Use the second function Q₂ to estimate the value of the chosen action a⁎.
The both functions Q₁ and Q₂ can be used in reverse order as well.
In double learning, only one estimate Q (not both) is updated on every iteration.
While the first Q-function selects the best action, the second Q-function provides its unbiased estimation.
Example
We will be looking at the example of how double learning is applied to Q-learning.
Iteration 1
To illustrate how double learning operates, let us consider a maze where the agent can move one step in any of four directions during each iteration. Our objective is to update the Q-function using the double Q-learning algorithm. We will use the learning rate α = 0.1 and the discount rate γ = 0.9.
For the first iteration, the agent starts at cell S = A2 and, following the current policy, moves one step right to S’ = B2 with the reward of R = 2.
We assume that we have to use the second update equation in the pseudocode shown above. Let us rewrite it:
Since our agent moves to state S’ = B2, we need to use its Q-values. Let us look at the current Q-table of state-action pairs including B2:
We need to find an action for S’ = B2 that maximizes Q₁ and ultimately use the respective Q₂-value for the same action.
- The maximum Q₁-value is achieved by taking the ← action (q = 1.2, red circle).
- The corresponding Q₂-value for the action ← is q = 0.7 (yellow circle).
Let us rewrite the update equation in a simpler form:
Assuming that the initial estimate Q₂(A2, →) = 0.5, we can insert values and perform the update:
Iteration 2
The agent is now located at B2 and has to choose the next action. Since we are dealing with two Q-functions, we have to find their sum:
Depending on a type of our policy, we have to sample the next action from a distribution. For instance, if we use an ε-greedy policy with ε = 0.08, then the action distribution will have the following form:
We will suppose that, with the 94% probability, we have sampled the ↑ action. That means the agent will move next to the S’ = B3 cell. The reward it receives is R = -3.
For this iteration, we assume that we have sampled the first update equation for the Q-function. Let us break it down:
We need to know Q-values for all actions corresponding to B3. Here they are:
Since this time we use the first update equation, we take the maximum Q₂-value (red circle) and use the respective Q₁-value (yellow circle). Then we can rewrite the equation in a simplified form:
After making all value substitutions, we can calculate the final result:
We have looked at the example of double Q-learning, which mitigates the maximization bias in the Q-learning algorithm. This double learning approach can also be extended as well to Sarsa and Expected Sarsa algorithms.
Instead of choosing which update equation to use with the p = 0.5 probability on each iteration, double learning can be adapted to iteratively alternate between both equations.
Despite their simplicity, temporal difference methods are amongst the most widely used techniques in reinforcement learning today. What is also interesting is that are also extensively applied in other prediction problems such as time series analysis, stock prediction, or weather forecasting.
So far, we have been discussing only a specific case of TD learning when n = 1. As we will see in the next article, it can be beneficial to set n to higher values in certain situations.
We have not covered it yet, but it turns out that control for TD algorithms can be implemented via actor-critic methods which will be discussed in this series in the future. For now, we have only reused the idea of GPI introduced in dynamic programming algorithms.
All images unless otherwise noted are by the author.