The fundamental goal of reinforcement learning is to train an agent to make sequential decisions that maximize its cumulative reward over time. This involves the agent learning an optimal policy that maps a given state to an action to achieve the highest possible return. Mathematically, this process is framed using the concept of expected return, which provides a quantitative estimate of the total future reward an agent can anticipate from any given state. This learning paradigm is both fascinating and increasingly important, as it forms the basis for training AI agents to perform a wide array of complex tasks. In this article, we will explore the basic principles of reinforcement learning to build a groundwork for understanding the incredible work in this field.
The Core Components
In reinforcement learning, the problem is defined within an environment, which is composed of a set of possible states . At any given time, an agent resides in a state
and selects an action
from its available options
. Upon executing an action
, the environment responds in two ways: it transitions the agent to a new state
based on a transition probability
, and it provides the agent with a numerical reward
.
This framework, comprising states, actions, transition probabilities, and rewards, is formally known as a Markov Decision Process (MDP) and serves as the mathematical framework for reinforcement learning. The design of this reward signal is crucial, as it directly defines the agent's goal and guides its learning process. For example, to train an agent to escape a maze as quickly as possible, we could assign a negative reward (a penalty) for each step taken. This incentivizes the agent to find the shortest path, as more steps accumulate a larger penalty. Conversely, if we want an agent to survive in a game for as long as possible, we would provide a positive reward for each step it survives, encouraging it to prolong the game.
Making Decisions: Policies and Value Functions
The agent's decision-making strategy in any given state is encapsulated by its policy . To determine whether a policy is effective, the agent needs a way to estimate the long-term value of its actions, not just the immediate reward. This crucial role belongs to value functions, which quantify the expected future reward.
There are two primary types of value functions:
- State-value function
: This function estimates the expected return (total cumulative reward) the agent will receive starting from a particular state
and following policy
thereafter.
- Action-value function ( Q-value,
): This function estimates the expected return for taking a specific action
in state
and then following policy
.
These functions are central to policy improvement, as they provide a measure of long-term success. The action-value function, in particular, is essential because it allows the agent to directly compare the expected outcomes of its available actions. The primary objective of most reinforcement learning algorithms is to use these value estimates to iteratively refine the current policy until it converges upon the optimal policy , which guarantees the maximum expected cumulative reward from any state.
Reinforcement Learning Algorithms
To find the optimal policy , various algorithms have been developed, each approaching the problem differently. These algorithms are often categorized by whether they require a complete model of the environment.
Dynamic Programming (DP)
Dynamic Programming (DP) represents the family of model-based algorithms. These methods require a complete and perfect mathematical description of the environment, specifically the full transition probabilities and the reward function
. Using this model, DP techniques like Value Iteration and Policy Iteration apply the Bellman equation to iteratively compute the optimal value functions and, in turn, find the optimal policy
.
While DP methods offer a guarantee of finding the optimal solution, their reliance on a perfect model is a significant limitation. In most real-world scenarios, this model is either inaccessible or too complex to define explicitly. This significant drawback necessitates the use of model-free approaches, which can learn solely through experience and interaction.
Monte Carlo (MC) Methods
In contrast to model-based approaches, Monte Carlo (MC) methods are a foundational model-free technique that learns purely from direct interaction with the environment. Instead of requiring a predefined model, MC methods derive their value estimates from complete episodes of experience. Specifically, after an episode concludes, the agent calculates the actual cumulative return achieved. This actual return then serves as the target used to update the value estimate for every state-action pair visited during that run, typically by averaging it with previous episode returns.
This fundamental model-free nature makes MC methods highly flexible for real-world problems where the environment dynamics are unknown. However, its dependence on complete episodes is a major drawback. Learning can be slow and inefficient for tasks with very long episodes, and crucially, MC is inherently unsuitable for continuous (non-episodic) tasks where termination is never guaranteed. This limitation points directly to the need for algorithms capable of learning before an episode is complete.
Temporal-Difference (TD) Learning
The limitations of MC methods directly motivate the development of Temporal-Difference (TD) Learning. Like Monte Carlo, TD learning is model-free and learns directly from experience, but its key innovation is its ability to learn from incomplete episodes. TD methods do not wait for the episode's end. Instead, they update their value estimates after every step using a process called bootstrapping, which means the current state's value is updated based on the estimated value of the immediate next state, rather than the final empirical return. This capability makes TD learning more efficient and enables its use in continuous, non-episodic tasks.
Two of the most pivotal TD algorithms are SARSA and Q-Learning. Both aim to find the optimal action-value function , but they differ in their update strategy:
- SARSA (On-Policy): This algorithm learns the value function for the policy currently being executed. Its update rule uses the action actually taken
in the subsequent state
, meaning it evaluates the performance of the agent's current behavior policy.
- Q-Learning (Off-Policy): This algorithm learns the value function for the optimal policy
independently of the policy being executed. Its update rule uses the maximum possible Q-value in the subsequent state, effectively learning about the optimal strategy even while the agent is exploring sub-optimally.
The principles of these algorithms are central to modern deep reinforcement learning. For instance, Deep Q-Learning (DQN) replaces the traditional Q-table with a deep neural network that acts as a function approximator. This allows the agent to generalize its value estimates across states it has never explicitly visited, enabling DQNs to successfully tackle environments with massive or continuous state spaces, paving the way for many of the notable breakthroughs in modern AI.
Under the Hood: The Core Equations
The core theoretical definition of value in Reinforcement Learning rests on the Bellman Expectation Equation, which defines the long-term value of a state or action under a given policy . For the action-value function
, the equation is:
\begin{align}
q_\pi(s, a) = \sum_{s',r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a')]
\end{align}
This equation states that the long-term value of taking action in state
is the expectation over all possible outcomes, calculated as the immediate reward
plus the expected value of the next state
, discounted by
, all weighted by the probability
of that specific outcome.
However, this formulation requires a perfect environment model to provide the term. This is the critical juncture where model-free methods diverge. Since they don't have the environmental dynamics, model-free methods must forgo the full expectation, instead using a sample-based target to update their value estimates. The structure of this target defines the algorithm:
- The Monte Carlo (MC) Target is the full, actual return
calculated at the end of an episode:
\begin{align}
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
\end{align}
- The Temporal-Difference (TD) Target is a one-step bootstrapped estimate. For example, the target for SARSA is the immediate reward plus the discounted estimated value of the next action actually taken:
\begin{align}
R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})
\end{align}
In both cases, an estimated value function is iteratively updated toward the sampled target. Through this iterative process, the estimated action-value function is eventually proven to converge toward the true theoretical value. This guaranteed convergence, however, is contingent upon satisfying specific conditions, most notably ensuring a properly decaying learning rate and maintaining sufficient exploration (e.g., using an
-greedy strategy) throughout the learning period.
While both MC and TD methods rely on this principle of iterative updates, the efficiency and flexibility of Temporal-Difference learning have made it especially influential. The key to understanding Temporal-Difference methods is answering the immediate question: How is the sampled TD target used to refine the current Q-value estimate?
The answer lies in calculating an error signal. This is called the Temporal-Difference (TD) Error, which is the difference between the TD Target (our one-step lookahead estimate) and our current Q-value (our old estimate). For an on-policy algorithm like SARSA, the term inside the brackets in the following update rule is precisely this TD Error:
\begin{align}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
\end{align}
In this formula, the TD Error is multiplied by a learning rate , which is a hyperparameter that controls the magnitude of the update. This is critical for ensuring stable learning and avoiding overshooting the true value. This process of iteratively updating an estimate by taking a small step in the direction of an error-corrected target is the guiding principle that underpins all Temporal-Difference methods.
Building on the foundation of on-policy methods, we now introduce off-policy learning. This powerful paradigm allows the agent to evaluate or learn about a target policy (typically the optimal one) while executing a separate behavior policy (used for exploration).
The most seminal off-policy algorithm, Q-Learning, employs an update rule that looks nearly identical to SARSA's but contains one critical distinction within the TD Target:
\begin{align}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]
\end{align}
The inclusion of the operator means the Q-Learning target uses the maximum possible Q-value from the next state. This represents the best action that could be selected, contrasting sharply with SARSA, which uses the Q-value of the action actually performed.
The practical implication is profound: an on-policy method like SARSA evaluates its current, exploratory strategy, whereas an off-policy method like Q-Learning directly learns the value of the optimal policy . This decoupling allows the agent to discover the optimal solution efficiently, regardless of the cautious or random nature of its exploration strategy.
Conclusion
The journey from defining an environment to dissecting the Q-Learning update rule is not only a theoretical exercise; it's the meaningful transition from being a passive consumer of AI to becoming an informed practitioner. Understanding the core principles of reinforcement learning (RL), the interplay between policies, value functions, and reward signals, reveals the logic behind agentic behavior, enabling it to be understood, evaluated, and improved. As agentic systems become increasingly prevalent, mastering these core concepts equips practitioners with the strategic insight needed to design, troubleshoot, and responsibly guide the complex AI systems that will shape our future.
Sheng Fu Huang - Monex Insight