Basic Knowledge of Reinforcement Learning before PPO

Feb. 23, 2025 - Feb. 24, 2025 · Qiyao Wang #RL

Claim：本文参考 notes on Reinforcement Learning 和 Hung-yi Lee 总结记录学习，记录难点和分析，最基础的内容不赘述。

Basic Concept

What's the Reinforcement Learning: A formal definition

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

Translation: 强化学习是一种通过构建与环境交互并从试错中学习的代理来解决控制任务（也称为决策问题）的框架。代理通过试错与环境交互，并根据其行为收到正向或负向的奖励作为唯一的反馈。

Difference between Observations and States

Observations 和 States 都表示 Agent 从环境中获取的信息，在概念上二者几乎一致，仅在信息的覆盖范围上存在差异，在实现上一般会进行区分。二者的区别如下：

State $s$: is a complete description of the state of the world (there is no hidden information). 全部信息，无隐藏内容。
Observation $o$: is a partial description of the state. 部分信息。

Two main approaches for solving RL problems

RL 的目标是得到一个好的 Agent，即得到一个最优的 Policy Model $\pi^*$，能够根据当前的环境的 State 做出良好的决策。Policy Model $\pi$ 的优化目标是最大化期望回报（expected return）。

训练 Agent 得到最优 policy $\pi^*$ 的方法有两种：

Policy-Based Methods: 直接地训练模型在给定 State 的情况下，选择适合的 Action。
Value-Based Methods: 训练模型能够判断更有价值的 State，并且采取适合的 Action，来达到更有价值的 State。

Policy-Based Methods

对于 Policy-Based Methods，其目标是直接学一个 policy function。对于 Action 是离散的， policy model $\pi$ 给定 State 返回适合的 Action

$$a_t = \pi(s_t)$$

对于 Action 是连续的 policy model $\pi$ 实际上返回的是一个 Action distribution，适合的 Action 应该能够从该分布中被采样出来

$$a_t\sim\pi(\cdot|s_t)$$

Value-Based Methods

对于 Value-Based Methods，其目标是能够预测在某一 State $s$ 下的 accumulated expected return （state-value function $V_{\pi}(s)$）或某一 State $s$ 下采取 Action $a$ 后的 accumulated expected return (action-value function $Q_{\pi}(s,a)$)。

The state-value function

$$V_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s]$$

state-value function 的含义是估计在 State $s$ 的条件下，累积折扣回报 $G_t=\sum_{k=0}^∞\gamma^kR_{t+k+1}$ 的期望值大小。

The action-value function

$$Q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$$

action-value function 实际上是 state-action-value function，即在给定 State $s$ 和 Action $a$ 的条件下，累积折扣回报 $G_t$ 的期望值大小。由此可见，Value-Based Method 的估计目标都是：$G_t$，即累积折扣回报的期望值大小。

Relation between V and Q Function

从定义看，$V_\pi$ 和 $Q_\pi$ 都是对累积折扣回报 $G_t$ 的期望值估计，但是二者所站角度不同。$Q_\pi$ 相较于 $V_\pi$ 而言，要考虑 Action $a$ 条件的影响，而 $V_\pi$ 则不关心 Action，从 $Q_\pi$ 得到 $V_\pi$ 的公式如下

$$V_{\pi}(s)=\sum_{a\in\mathcal A}\pi(a|s)Q_\pi(s,a)$$

即，对 $Q_\pi$ 中的 Action 进行期望的计算，排除其中选择某一 Action 的影响，而是在 Action Space 下的期望值。而 $Q_\pi(s,a)$ 中的 $a$ 其实也仅考虑了一步 State $s$ 下的 Action，对于之后的 State 并不强制其 Action，可借此与 $V_\pi$ 关联

$$Q_\pi(s,a) = \sum_{s'\in\mathcal S}P(s'|s,a)[R(s,a,s')+\gamma V_\pi(s')]$$

其中，State $s$ 和该状态下的 Action $a$ 固定，而后续的 State 不确定，会从 State Space $\mathcal{S}$ 中进行采样，因此对于 State $s$ 采取 Action $a$ 后的下一个状态 $s'$ 需要结合状态转移函数 $P(s'|s,a)$ 进行计算。$R(s,a,s')$ 则为从 State $s$, Action $a$ 转移为 State $s'$ 后环境给予的 Reward 值，$V_\pi(s')$ 则表示在 State $s'$ 条件下，未来的累积折扣期望值。

Bellman Equation

Bellman Equation 与上述从 $V_\pi$ 得到 $Q_\pi$ 的公式较为类似，以一种动态规划/递归的形式，将在当前状态下的 expected return 拆解为当前立即得到的 Reward 与对未来（下一状态）的 expected return 的估计。

Bellman Equation of the state-value function

$$V_\pi(s)=\mathbb{E}_\pi[R_{t+1}+\gamma*V_\pi(S_{t+1})|S_t=s]$$

Bellman Equation of the action-value function

$$Q_\pi(s,a)=\mathbb{E}_\pi[R_{t+1}+\gamma*Q_{\pi}(S_{t+1},A_{t+1})|S_t=s,A_t=a]$$

Closed-form Solution

上述方程，无论是对于 $V$ 还是 $Q$ Function 的思想都是一致的，Bellman Equation 为这两类 Value Function 提供了简化计算的方法。将 $V$ 与 $Q$ 关系中的式子融合，可以得到

$$V_\pi(s)=\sum_{a\in\mathcal A}\pi(a|s)\sum_{s'\in\mathcal S}P(s'|s,a)[R(s,a,s')+\gamma V_\pi(s')]$$

将上面的公式转化为矩阵形式，可以先看一下每个元素的含义，其中 $P(s'|s,a)$ 为状态转移的概率，$R(s,a,s')$ 为当前状态下的 reward，那么下面将这些元素矩阵化，将有以下内容

转移概率矩阵 $P_\pi\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{S}|}$： ${P}_\pi$ 代表在给定策略 $\pi$ 下，从状态 $s$ 到状态 $s'$ 的转移概率。它是通过对所有可能的动作 $a$ 进行加权求和得到的。
$$P_\pi(s,s')=\sum_{a\in\mathcal A}\pi(a|s)\cdot P(s'|s,a)$$
奖励向量 $\mathbf{r}_\pi\in\mathbb{R}^{|\mathcal{S}|\times1}$：表示在状态 $s$ 下，按照策略 $\pi$ 选择动作，经过状态转移得到的即时奖励的期望值
$$R_\pi(s) = \sum_{a\in\mathcal A}\sum_{s'\in\mathcal{S}}\pi(a|s)P(s'|s,a)R(s,a,s')$$

因此，矩阵形式为

$$\mathbf{v}_\pi=\mathbf{r}_\pi+\gamma P_\pi\mathbf{v}_\pi$$

经过矩阵运算，当 $I-\gamma P_\pi$ 可逆时，$\mathbf{v}_\pi=(I-\gamma P_\pi)^{-1}\mathbf{r}_\pi$ 为 Bellman Equation 的唯一解。

基于 Gershgorin circle theorem 可以证明 $I-\gamma P_\pi$ 可逆，此处略，因此 Bellman Equation 具有唯一的闭合解，这为之后的基于对 Bellman Equation 优化的方法提供了基础。

Optimal action-value

基于贝尔曼最优方程 (The Bellman Optimality Equation, BOE)，能够推导出最优的价值函数的形式解，推导最优的 policy model $\pi^*$，也就是说，BOE 的解，对应了最优的 state-value $V^*$ 和 policy $\pi^*$，基于这两者能够得到最优的 action-value $Q^*$。

$$\begin{aligned} V^*(s)&=\max_aQ^*(s,a)\\ &=\max_a\sum_{s'\in\mathcal{S}}P(s'|s,a)[R(s,a,s')+\gamma V^*(s')] \end{aligned}$$

$$\begin{aligned} Q^*(s)&=\sum_{s'\in\mathcal{S}}P(s'|s,a)[R(s,a,s')+\gamma V^*(s')]\\ &=\sum_{s'\in\mathcal{S}}P(s'|s,a)[R(s,a,s')+\gamma \max_aQ^*(s,a)] \end{aligned}$$

第一次了解 Bellman Equation，此部分逻辑不清，需梳理。

Monte Carlo & TD Learning

Monte Carlo

使用蒙特卡罗算法进行 value function 的参数更新时，对于累积折扣回报 $G_t$ 的计算，一直到整个 episode 结束，这样的估计会使得 Agent 能力较为精准，但是数据的方差大，训练不稳定。

$$V(S_t)\leftarrow V(S_t)+\alpha[G_t-V(S_t)]$$

Temporal Difference Learning

使用 TD Learning 来进行 value function 的参数更新，在每一步都进行，而不关注 $G_t$，关注当前步的即时奖励 $R_{t+1}$ 和下一状态的 value。这也称为 bootstrapping，因为 TD Learning 基于现有 value function 估计 $V(S_{t+1})$，而不是整个 episode 的累积奖励 $G_t$。

$$V(S_t)\leftarrow V(S_t)+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_t)]$$

其中 $R_{t+1}+\gamma V(S_{t+1})$ 称为 TD Target。对于 action-value function，将上述公式的 $V$ function 换为 $Q$ function 即可。

Q-Learning

Q-Learning 是一种 TD Learning 的扩展（特殊形式），可以直接估计最优的 action-value，并且找到最优的 policy。

$$Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha[R_{t+1}+\gamma\max_a Q(S_{t+1},a)-Q(S_t,A_t)]$$

与 TD Learning 相比，Q-Learning 要求其 TD Target 的第二项必须是下一个 State 的折扣估计的最优的 Q-value。但其实这个 $a=\arg\max Q$ 不好得到。Q-Learning 也是一种解决 BOE 的随机逼近算法。具体算法描述不再赘述，如其中对 DQN 的 Fixed 等见 Hung-yi Lee PPT。

Off-policy and On-policy

Q-Learning 与其他 TD Learning 算法不同的是，它是一种 off-policy 的算法。

RL 中包含两类 policy，一种是 behavior policy，另一个是 target policy。其中 behavior policy 用于生成训练数据，而 target policy 则被参数更新至收敛。

当 behavior policy 与 target policy 一致时，即生成训练数据与训练模型是同一个模型进行时，则称为 on-policy 的算法；当生成数据与被训练模型不同时，则为 off-policy，此时先通过 behavior policy 与环境进行大量交互，生成大批的训练数据后，再对 target policy 进行训练。

Policy Gradient

对于 Policy-Based Method 而言，更希望以一种无需学习 value function 的方式来直接优化 policy，得到最优的 policy model $\pi^*$。其中 policy model $\pi_\theta$ 以 State $s_t$ 作为输入，输出则为 actions 的概率分布。与有监督学习的梯度下降类似的是，可以使用 Policy-Gradient 来优化 policy model。

优化目标：使用 gradient ascent（梯度上升）maximize 参数化 policy model $\pi_\theta$ 的参数 $\theta$，来获得更大的期望累积回报 expected cumulative reward $J(\theta)$。

$$\begin{aligned} J(\theta)&=\mathbb{E}_{\tau\sim\pi}[R(\tau)]\\ &=\sum_{\tau}P(\tau;\theta)R(\tau) \end{aligned}$$

$$P(\tau;\theta)=\prod_{t=0}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)$$

$$\theta\leftarrow\theta+\alpha * \nabla J(\theta)$$

The Policy Gradient Theorem

优化 $J(\theta)$ 存在一些问题。第一，无法计算所有策略，依据大数定理利用采样的策略进行估计的优化；第二，其中与状态转移概率相关，此概率由环境决定，难以微分。可以使用 Policy Gradient Theorem 来重新表述。

$$\begin{aligned} \nabla_\theta J(\theta)&=\nabla_\theta\sum_{\tau}P(\tau;\theta)R(\tau)\\ &=\sum_\tau\nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau)\\ &\xlongequal{\nabla \log f(x) = \frac{\nabla f(x)}{f(x)}}\sum_\tau P(\tau;\theta)\nabla_\theta\log P(\tau;\theta)R(\tau)\\ &=\sum_\tau P(\tau;\theta)\nabla_\theta\{\log \mu(s_0) + \sum_{t=0}^H [ \log P(s_{t+1} \mid s_t, a_t) + \log \pi_\theta(a_t \mid s_t)]\}R(\tau)\\ &=\sum_\tau P(\tau;\theta)\{\nabla_\theta \log \mu(s_0) + \sum_{t=0}^H[\nabla_\theta \log P(s_{t+1} \mid s_t, a_t) + \nabla_\theta \log \pi_\theta(a_t \mid s_t)]\}R(\tau)\\ &= \sum_\tau P(\tau;\theta) \sum_{t=0}^H \nabla_\theta \log \pi_\theta(a_t \mid s_t)R(\tau)\\ &=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a_t \mid s_t)R(\tau)] \end{aligned}$$

上述的优化目标 $J(\theta)$ 最终就不再受环境影响。则在 Monte Carlo Reinforce 中，在一个训练循环中，首先利用 policy model $\pi_\theta$ 收集一个 batch（大小为 $m$）的 episodes/trajectories，之后利用这个 batch 的数据来计算 gradient 优化 $\pi_\theta$，由于为 Monte Carlo，利用的是一个完整的 episode。

$$\nabla_\theta J(\theta)\approx \frac{1}{m}\sum_{i=1}^m\sum_{t=0}\nabla_\theta\log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})R(\tau^{(i)})$$

进一步优化，一个直观的想法：Action 只对它之后的状态产生影响。因此可以去除当前步之前的奖励，即不使用一整个 episode 的 reward $R(\tau)$，而是 step $t$ 之后的累积回报。

$$\nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim\theta}[\sum_{t=0}^T\nabla_\theta\log\pi_\theta(a_t\mid s_t)\sum_{t'=t}^T R(s_{t'},a_{t'},s_{t'+1})]$$

$$\hat{R}_t=\sum_{t'=t}^T R(s_{t'},a_{t'},s_{t'+1})$$

通常会在 $\hat{R}_t$ 中增加一项 baseline $b(s_t)$ 来使得奖励有正有负，便于优化。

$$\hat{R}_t=\sum_{t'=t}^T R(s_{t'},a_{t'},s_{t'+1}-b(s_t))$$

Advantages and Disadvantages of policy-gradient methods

Advantages over value-based methods

Policy-gradient methods can learn a stochastic policy while value functions can't.
Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces.
Policy-gradient methods have better convergence properties.

Disadvantages

Frequently, it converges to a local maximum instead of a global optimum.
Training goes slower, step by step which needs longer time.
Have high variance.

Actor-Critic Methods

Actor-Critic Method 融合了 value-based 和 policy-based methods，通过减少方差帮助模型训练更稳定。其中 Actor 其实就是 Agent，或者说是 Policy-Based Method 中的 $\pi_\theta$；而 Critic 相当于是 Value-Based Method 中的 Q Function，来衡量 action 的好坏。

Advantage Actor-Critic (A2C)

A2C 中的学习目标：A policy model $\pi_\theta(s)$；A value function $\hat{q}_w(s,a)$。具体的优化过程如下：

在每一步 timestep $t$，从环境中得到当前的状态 $S_t$
Actor Model 将 $S_t$ 作为输入，输出一个合适的 Action $A_t$
Critic Model 将 $S_t, A_t$ 共同作为输入，输出在该状态下采取该行动将得到的 Q-value 的估计
在环境中执行 $A_t$ 将会进行状态转移，得到一个 Reward $R_{t+1}$，环境的状态也会转移至 $S_{t+1}$
Actor Model 使用 Critic Model 给出的 Q-value 值进行参数的更新
$$\Delta \theta = \alpha\nabla_\theta(\log \pi_\theta(s,a))\hat{q}_w(s,a)$$
Actor Model 经历过参数更新后，会根据新状态 $S_{t+1}$ 给出对应的合适的 Action $A_{t+1}$
Critic Model 则利用 Q-Learning 的呃方式，更新其参数
$$\Delta w=\beta(r_{t+1}+\gamma\hat{q}_w(s_{t+1},a_{t+1})-\hat{q}_w(s_t,a_t))\nabla_w \hat{q}_w(s_t,a_t)$$

进一步地，将 Q-Function 改为 Advantage Function 来稳定模型的训练

$$\begin{aligned} A(s,a)&=Q(s,a)-V(s)\\ &=r+\gamma V(s')-V(s) \end{aligned}$$

这样只需要估计具有更小方差的 state value 就可以进行 Critic Model 的优化，可以将 $\Delta w$ 中的 $r_{t+1}+\gamma\hat{q}_w(s_{t+1},a_{t+1})-\hat{q}_w(s_t,a_t)$ 替换为 $A(s,a)$。

End. 后续将从 PPO 的前身 TRPO 开启 PPO。

Reference

Contact

There may be some errors present. If you find any, please feel free to contact me at wangqiyao@mail.dlut.edu.cn. I would appreciate it!