2026-05-18 11:19:39 +08:00
|
|
|
|
% ============================================================
|
|
|
|
|
|
\section{Module C:RL自适应干预策略}
|
|
|
|
|
|
\label{sec:moduleC}
|
|
|
|
|
|
% ============================================================
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{问题建模}
|
|
|
|
|
|
|
|
|
|
|
|
将干预动作选择建模为马尔可夫决策过程(MDP)。
|
|
|
|
|
|
给定当前时刻$t$的检测结果$D_t$和上下文信息,
|
|
|
|
|
|
策略$\pi$输出干预动作$a_t$:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{equation}
|
|
|
|
|
|
a_t = \pi(s_t),\quad s_t = f(D_t,\ e_{H_\text{pool}},\ e_{P_\text{pool}},\ t_\text{norm})
|
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{动作空间}
|
|
|
|
|
|
|
|
|
|
|
|
干预动作集合$\mathcal{A} = \{\text{PASS, WARN, REWRITE, REJECT, CRISIS}\}$定义如下:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
\item \textbf{PASS}:放行,无干预(适用于安全内容)
|
|
|
|
|
|
\item \textbf{WARN}:向用户发送温和提示(适用于轻微不当)
|
|
|
|
|
|
\item \textbf{REWRITE}:改写AI回复,去除风险内容(适用于中高风险)
|
|
|
|
|
|
\item \textbf{REJECT}:拒绝当前回复,请求重新生成(适用于不可改写的高危内容)
|
|
|
|
|
|
\item \textbf{CRISIS}:危机引导,强制插入心理援助资源与现实求助信息(适用于R1危机场景)
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
这五类动作覆盖了平台实际运营中的完整干预响应谱,
|
|
|
|
|
|
代价和效益差异巨大——PASS最小侵入,CRISIS最强干预。
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{状态空间}
|
|
|
|
|
|
|
|
|
|
|
|
状态向量$s_t \in \mathbb{R}^{2065}$由以下分量拼接而成:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{equation}
|
|
|
|
|
|
s_t = [d_\text{score}(1)\ |\ l^\text{det}_\text{onehot}(5)\ |\ c_\text{primary\_probs}(10)\ |\ e_{H_\text{pool}}(1024)\ |\ e_{P_\text{pool}}(1024)\ |\ t_\text{norm}(1)]
|
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
|
|
其中$d_\text{score}$为检测器输出的风险概率,
|
|
|
|
|
|
$l^\text{det}_\text{onehot}$为检测器预测的风险等级(one-hot编码,使用检测器预测值而非真值),
|
|
|
|
|
|
$c_\text{primary\_probs}$为10类一级风险的Softmax概率,
|
|
|
|
|
|
$e_{H_\text{pool}},e_{P_\text{pool}}$为对话历史和角色设定的MacBERT池化嵌入,
|
|
|
|
|
|
$t_\text{norm}$为归一化当前轮次。
|
|
|
|
|
|
|
|
|
|
|
|
注意:状态向量严格使用检测器的\textit{预测值},
|
|
|
|
|
|
而非ground truth标注,以确保训练条件与部署条件的一致性。
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{奖励函数设计}
|
|
|
|
|
|
|
|
|
|
|
|
奖励函数$r(s_t, a_t)$包含以下多目标分量:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{equation}
|
|
|
|
|
|
r = w_1 \cdot r_\text{safety} - w_2 \cdot r_\text{fneg} + w_3 \cdot r_\text{crisis} - w_4 \cdot r_\text{over} - w_5 \cdot r_\text{ux}
|
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
\item $r_\text{safety}$:安全收益,对高风险内容采取适当干预时给正奖励($w_1=2.0$)
|
|
|
|
|
|
\item $r_\text{fneg}$:漏检惩罚,L3/L4样本被PASS时给强惩罚($w_2=3.0$)
|
|
|
|
|
|
\item $r_\text{crisis}$:危机引导奖励,R1危机场景触发CRISIS时额外奖励($w_3=4.0$)
|
|
|
|
|
|
\item $r_\text{over}$:过拒惩罚,安全内容被REWRITE及以上干预时给惩罚($w_4=1.5$)
|
|
|
|
|
|
\item $r_\text{ux}$:体验代价,强干预动作的用户体验损耗($w_5=0.5$)
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
该多目标奖励显式建模了"安全保障"与"用户体验"之间的权衡,
|
|
|
|
|
|
避免策略退化为激进拒绝(所有内容REJECT)或消极放行(所有内容PASS)。
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{策略网络}
|
|
|
|
|
|
|
|
|
|
|
|
Actor-Critic网络以状态向量$s_t \in \mathbb{R}^{2065}$为输入:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{equation}
|
|
|
|
|
|
\text{StateEncoder}:\ \mathbb{R}^{2065} \to \mathbb{R}^{256}
|
|
|
|
|
|
\quad \text{(2层MLP + LayerNorm + GELU)}
|
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
|
|
Actor头和Critic头均以256维隐表示为输入,
|
|
|
|
|
|
分别输出5类动作的logits和状态价值估计。
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{两阶段训练}
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{阶段一:行为克隆预热(BC)}
|
|
|
|
|
|
|
|
|
|
|
|
以数据集中的推荐动作$a_\text{recommend}$为监督信号,
|
|
|
|
|
|
对策略网络进行5轮行为克隆预训练($\text{lr}=10^{-3}$,批大小256)。
|
|
|
|
|
|
BC阶段使模型快速学习符合标注规律的基本干预模式,
|
|
|
|
|
|
避免PPO从随机策略开始探索时的低效问题。
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{阶段二:PPO强化学习优化}
|
|
|
|
|
|
|
|
|
|
|
|
在BC预热的基础上,使用PPO算法\cite{schulman2017ppo}
|
|
|
|
|
|
在CompanionRisk-Bench训练集上进行离线RL优化:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{table}[ht]
|
|
|
|
|
|
\centering
|
|
|
|
|
|
\caption{Module C PPO训练配置}
|
|
|
|
|
|
\label{tab:moduleC_train}
|
|
|
|
|
|
\begin{tabular}{ll}
|
|
|
|
|
|
\toprule
|
|
|
|
|
|
配置项 & 值 \\
|
|
|
|
|
|
\midrule
|
|
|
|
|
|
总交互步数 & 200,000步 \\
|
|
|
|
|
|
每次rollout步数 & 2,048 \\
|
|
|
|
|
|
PPO更新轮次 & 4 \\
|
|
|
|
|
|
批大小 & 256 \\
|
|
|
|
|
|
学习率 & $3 \times 10^{-4}$ \\
|
|
|
|
|
|
裁剪系数$\epsilon$ & 0.2 \\
|
|
|
|
|
|
熵系数 & 0.01 \\
|
|
|
|
|
|
折扣因子$\gamma$ & 0.99 \\
|
|
|
|
|
|
GAE $\lambda$ & 0.95 \\
|
|
|
|
|
|
GPU & 1 $\times$ RTX 5090(单卡)\\
|
|
|
|
|
|
\bottomrule
|
|
|
|
|
|
\end{tabular}
|
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
|
|
注意:PPO阶段强制使用单卡,避免RTX 5090上
|
|
|
|
|
|
\texttt{torch.distributed.barrier()}引发的CUDA内存访问异常。
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{实验结果}
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{主要结果}
|
|
|
|
|
|
|
|
|
|
|
|
表\ref{tab:moduleC_main}对比了Module C与两个基线策略:
|
|
|
|
|
|
Rule-based(l\_risk$\geq3$即REJECT,其余PASS)
|
|
|
|
|
|
和Threshold Baseline(按风险分数设定各动作阈值)。
|
|
|
|
|
|
|
|
|
|
|
|
\begin{table}[ht]
|
|
|
|
|
|
\centering
|
|
|
|
|
|
\caption{Module C干预策略对比(测试集,$n=1,486$)}
|
|
|
|
|
|
\label{tab:moduleC_main}
|
|
|
|
|
|
\begin{tabular}{lccccc}
|
|
|
|
|
|
\toprule
|
|
|
|
|
|
方法 & SafetyRecall & OverRefusal & ActionAcc & CrisisPrecision & UX Fscore \\
|
|
|
|
|
|
\midrule
|
|
|
|
|
|
Rule-based & 0.908 & 0.000 & — & — & 0.952 \\
|
|
|
|
|
|
Threshold & 0.908 & 0.000 & — & 0.624 & 0.952 \\
|
2026-05-20 14:24:09 +08:00
|
|
|
|
BC-only(行为克隆) & 0.940 & 0.000 & 0.696 & 0.509 & 0.969 \\
|
|
|
|
|
|
LLM-as-judge(Qwen2.5-72B) & 0.397 & 0.211 & 0.374 & 0.250 & 0.528 \\
|
2026-05-18 11:19:39 +08:00
|
|
|
|
\midrule
|
2026-05-20 14:24:09 +08:00
|
|
|
|
\textbf{Ours(RL)} & \textbf{0.953} & \textbf{0.000} & \textbf{0.706} & 0.571 & \textbf{0.976} \\
|
2026-05-18 11:19:39 +08:00
|
|
|
|
\bottomrule
|
|
|
|
|
|
\end{tabular}
|
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{各风险等级动作分布}
|
|
|
|
|
|
|
|
|
|
|
|
表\ref{tab:per_level_action}展示三种方法在各风险等级上的动作分布,
|
|
|
|
|
|
直观体现了RL策略的细粒度判断能力。
|
|
|
|
|
|
|
|
|
|
|
|
\begin{table}[ht]
|
|
|
|
|
|
\centering
|
2026-05-20 14:24:09 +08:00
|
|
|
|
\caption{各风险等级动作分布(测试集,Module C v6,含推理时 safety floor)}
|
2026-05-18 11:19:39 +08:00
|
|
|
|
\label{tab:per_level_action}
|
|
|
|
|
|
\resizebox{\textwidth}{!}{%
|
|
|
|
|
|
\begin{tabular}{llrrrrrr}
|
|
|
|
|
|
\toprule
|
|
|
|
|
|
方法 & 等级 & $n$ & PASS & WARN & REWRITE & REJECT & CRISIS \\
|
|
|
|
|
|
\midrule
|
|
|
|
|
|
\multirow{5}{*}{Rule-based}
|
|
|
|
|
|
& L0 Safe & 237 & 1.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
|
|
|
|
|
|
& L1 Mild & 280 & 0.918 & 0.000 & 0.000 & 0.082 & 0.000 \\
|
|
|
|
|
|
& L2 Moderate & 317 & 0.420 & 0.000 & 0.000 & 0.580 & 0.000 \\
|
|
|
|
|
|
& L3 High & 456 & 0.114 & 0.000 & 0.000 & 0.886 & 0.000 \\
|
|
|
|
|
|
& L4 Critical & 196 & 0.041 & 0.000 & 0.000 & 0.959 & 0.000 \\
|
|
|
|
|
|
\midrule
|
|
|
|
|
|
\multirow{5}{*}{Threshold}
|
|
|
|
|
|
& L0 Safe & 237 & 1.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
|
|
|
|
|
|
& L1 Mild & 280 & 0.843 & 0.075 & 0.082 & 0.000 & 0.000 \\
|
|
|
|
|
|
& L2 Moderate & 317 & 0.044 & 0.375 & 0.552 & 0.000 & 0.028 \\
|
|
|
|
|
|
& L3 High & 456 & 0.009 & 0.105 & 0.739 & 0.000 & 0.147 \\
|
|
|
|
|
|
& L4 Critical & 196 & 0.000 & 0.041 & 0.316 & 0.000 & 0.643 \\
|
|
|
|
|
|
\midrule
|
2026-05-20 14:24:09 +08:00
|
|
|
|
\multirow{5}{*}{\textbf{Ours(RL)}}
|
|
|
|
|
|
& L0 Safe & 237 & 1.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
|
|
|
|
|
|
& L1 Mild & 280 & 0.821 & 0.071 & 0.100 & 0.007 & 0.000 \\
|
|
|
|
|
|
& L2 Moderate & 317 & 0.025 & 0.271 & 0.593 & 0.069 & 0.041 \\
|
|
|
|
|
|
& L3 High & 456 & 0.007 & 0.059 & 0.711 & 0.154 & 0.070 \\
|
|
|
|
|
|
& L4 Critical & 196 & 0.000 & 0.005 & 0.214 & 0.474 & 0.306 \\
|
2026-05-18 11:19:39 +08:00
|
|
|
|
\bottomrule
|
|
|
|
|
|
\end{tabular}
|
|
|
|
|
|
}
|
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
|
|
RL策略的核心优势在于:
|
|
|
|
|
|
(1)L2-L3层级主要选择REWRITE(改写)而非简单REJECT,
|
|
|
|
|
|
平衡了安全性与用户体验;
|
2026-05-20 14:24:09 +08:00
|
|
|
|
(2)L3/L4样本的PASS率$\leq$0.7\%,safety\_recall达0.953,
|
|
|
|
|
|
较规则基线(0.908)提升4.5pp;而规则基线由于检测器等级预测误差
|
|
|
|
|
|
(level\_weighted\_f1=0.559)导致9.2\%的高危样本被错误放行。
|
|
|
|
|
|
L4层级CRISIS动作占30.6\%,高于Threshold基线(CRISIS限于此层级),
|
|
|
|
|
|
体现了RL策略对最高危场景的主动识别能力。
|
|
|
|
|
|
策略包含推理时safety floor:将L3/L4上的WARN动作强制映射为REWRITE,
|
|
|
|
|
|
属于constrained intervention policy设计,确保高危场景不被轻度回应。
|
2026-05-18 11:19:39 +08:00
|
|
|
|
|
|
|
|
|
|
\subsubsection{消融实验}
|
|
|
|
|
|
|
2026-05-20 14:24:09 +08:00
|
|
|
|
为量化各训练阶段和奖励组件的贡献,
|
|
|
|
|
|
我们对Module C进行三组对照实验:
|
|
|
|
|
|
(1)\textbf{BC-only}:仅行为克隆热启动,跳过PPO强化学习阶段;
|
|
|
|
|
|
(2)\textbf{w/o Category Reward}:BC+PPO完整训练,但移除类别特定奖励项
|
|
|
|
|
|
(即禁用CRISIS\_R1奖励、REJECT\_R6R7奖励、REWRITE\_companion奖励和
|
|
|
|
|
|
CRISIS\_misuse惩罚,保留对齐信号和安全硬约束);
|
|
|
|
|
|
(3)\textbf{Full RL(完整模型)}:保留所有奖励组件的BC+PPO训练。
|
|
|
|
|
|
|
|
|
|
|
|
\begin{table}[ht]
|
|
|
|
|
|
\centering
|
|
|
|
|
|
\caption{Module C干预策略消融实验(测试集,$n=1,486$,含safety floor约束)。}
|
|
|
|
|
|
\label{tab:moduleC_ablation}
|
|
|
|
|
|
\begin{tabular}{lccccc}
|
|
|
|
|
|
\toprule
|
|
|
|
|
|
变体 & SafetyRecall & OverRefusal & ActionAcc & CrisisPrec & UX F-score \\
|
|
|
|
|
|
\midrule
|
|
|
|
|
|
BC-only & 0.940 & 0.000 & 0.697 & 0.509 & 0.969 \\
|
|
|
|
|
|
w/o Category Reward & 0.951 & 0.000 & \textbf{0.712} & 0.486 & 0.975 \\
|
|
|
|
|
|
\midrule
|
|
|
|
|
|
\textbf{Full RL(Ours)} & \textbf{0.953} & \textbf{0.000} & 0.706 & \textbf{0.571} & \textbf{0.976} \\
|
|
|
|
|
|
\bottomrule
|
|
|
|
|
|
\end{tabular}
|
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
|
|
PPO阶段将safety\_recall从0.940(BC-only)提升至0.953(+1.3pp),
|
|
|
|
|
|
验证了强化学习对安全召回的正向贡献。
|
|
|
|
|
|
类别特定奖励对ActionAcc的影响为轻微下降(0.712$\to$0.706,$-$0.6pp),
|
|
|
|
|
|
但显著提升CrisisPrecision(0.486$\to$0.571,+8.5pp):
|
|
|
|
|
|
CRISIS\_R1\_BONUS引导策略在R1类自伤样本上优先使用CRISIS动作,
|
|
|
|
|
|
CRISIS\_misuse惩罚则抑制了将非危机内容误判为CRISIS的过度响应,
|
|
|
|
|
|
两者合力使策略在动作校准上更加精准。
|
|
|
|
|
|
ActionAcc的边际下降源于类别特定奖励驱使策略偏离部分a\_recommend标注
|
|
|
|
|
|
(例如:标注建议REWRITE的R1样本被策略合理地升级为CRISIS),
|
|
|
|
|
|
属于安全优先的设计取舍。
|