feat: add paper/ LaTeX draft, English data scripts, update progress docs

- paper/: 22-page LaTeX framework (7/10 sections complete, compiles cleanly) main.tex + 10 section files + refs.bib + compiled PDF (329KB) - code/scripts/: three English dataset generation & merging scripts generate_english.py / generate_english_targeted.py / merge_v5.py - CLAUDE.md: update paper writing status, add paper/ file map entry - state.md: add section 8 paper writing progress (2026-05-15) - .gitignore: add LaTeX build artifact exclusion rules Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:19:39 +08:00
parent b50cf395ab
commit 804ebd2f77
19 changed files with 3047 additions and 3 deletions
--- a/paper/sections/06_moduleC.tex
+++ b/paper/sections/06_moduleC.tex
@@ -0,0 +1,194 @@
+% ============================================================
+\section{Module C：RL自适应干预策略}
+\label{sec:moduleC}
+% ============================================================
+
+\subsection{问题建模}
+
+将干预动作选择建模为马尔可夫决策过程（MDP）。
+给定当前时刻$t$的检测结果$D_t$和上下文信息，
+策略$\pi$输出干预动作$a_t$：
+
+\begin{equation}
+    a_t = \pi(s_t),\quad s_t = f(D_t,\ e_{H_\text{pool}},\ e_{P_\text{pool}},\ t_\text{norm})
+\end{equation}
+
+\subsubsection{动作空间}
+
+干预动作集合$\mathcal{A} = \{\text{PASS, WARN, REWRITE, REJECT, CRISIS}\}$定义如下：
+
+\begin{itemize}
+    \item \textbf{PASS}：放行，无干预（适用于安全内容）
+    \item \textbf{WARN}：向用户发送温和提示（适用于轻微不当）
+    \item \textbf{REWRITE}：改写AI回复，去除风险内容（适用于中高风险）
+    \item \textbf{REJECT}：拒绝当前回复，请求重新生成（适用于不可改写的高危内容）
+    \item \textbf{CRISIS}：危机引导，强制插入心理援助资源与现实求助信息（适用于R1危机场景）
+\end{itemize}
+
+这五类动作覆盖了平台实际运营中的完整干预响应谱，
+代价和效益差异巨大——PASS最小侵入，CRISIS最强干预。
+
+\subsubsection{状态空间}
+
+状态向量$s_t \in \mathbb{R}^{2065}$由以下分量拼接而成：
+
+\begin{equation}
+    s_t = [d_\text{score}(1)\ |\ l^\text{det}_\text{onehot}(5)\ |\ c_\text{primary\_probs}(10)\ |\ e_{H_\text{pool}}(1024)\ |\ e_{P_\text{pool}}(1024)\ |\ t_\text{norm}(1)]
+\end{equation}
+
+其中$d_\text{score}$为检测器输出的风险概率，
+$l^\text{det}_\text{onehot}$为检测器预测的风险等级（one-hot编码，使用检测器预测值而非真值），
+$c_\text{primary\_probs}$为10类一级风险的Softmax概率，
+$e_{H_\text{pool}},e_{P_\text{pool}}$为对话历史和角色设定的MacBERT池化嵌入，
+$t_\text{norm}$为归一化当前轮次。
+
+注意：状态向量严格使用检测器的\textit{预测值}，
+而非ground truth标注，以确保训练条件与部署条件的一致性。
+
+\subsection{奖励函数设计}
+
+奖励函数$r(s_t, a_t)$包含以下多目标分量：
+
+\begin{equation}
+r = w_1 \cdot r_\text{safety} - w_2 \cdot r_\text{fneg} + w_3 \cdot r_\text{crisis} - w_4 \cdot r_\text{over} - w_5 \cdot r_\text{ux}
+\end{equation}
+
+\begin{itemize}
+    \item $r_\text{safety}$：安全收益，对高风险内容采取适当干预时给正奖励（$w_1=2.0$）
+    \item $r_\text{fneg}$：漏检惩罚，L3/L4样本被PASS时给强惩罚（$w_2=3.0$）
+    \item $r_\text{crisis}$：危机引导奖励，R1危机场景触发CRISIS时额外奖励（$w_3=4.0$）
+    \item $r_\text{over}$：过拒惩罚，安全内容被REWRITE及以上干预时给惩罚（$w_4=1.5$）
+    \item $r_\text{ux}$：体验代价，强干预动作的用户体验损耗（$w_5=0.5$）
+\end{itemize}
+
+该多目标奖励显式建模了"安全保障"与"用户体验"之间的权衡，
+避免策略退化为激进拒绝（所有内容REJECT）或消极放行（所有内容PASS）。
+
+\subsection{策略网络}
+
+Actor-Critic网络以状态向量$s_t \in \mathbb{R}^{2065}$为输入：
+
+\begin{equation}
+    \text{StateEncoder}:\ \mathbb{R}^{2065} \to \mathbb{R}^{256}
+    \quad \text{（2层MLP + LayerNorm + GELU）}
+\end{equation}
+
+Actor头和Critic头均以256维隐表示为输入，
+分别输出5类动作的logits和状态价值估计。
+
+\subsection{两阶段训练}
+
+\subsubsection{阶段一：行为克隆预热（BC）}
+
+以数据集中的推荐动作$a_\text{recommend}$为监督信号，
+对策略网络进行5轮行为克隆预训练（$\text{lr}=10^{-3}$，批大小256）。
+BC阶段使模型快速学习符合标注规律的基本干预模式，
+避免PPO从随机策略开始探索时的低效问题。
+
+\subsubsection{阶段二：PPO强化学习优化}
+
+在BC预热的基础上，使用PPO算法\cite{schulman2017ppo}
+在CompanionRisk-Bench训练集上进行离线RL优化：
+
+\begin{table}[ht]
+\centering
+\caption{Module C PPO训练配置}
+\label{tab:moduleC_train}
+\begin{tabular}{ll}
+\toprule
+配置项 & 值 \\
+\midrule
+总交互步数 & 200,000步 \\
+每次rollout步数 & 2,048 \\
+PPO更新轮次 & 4 \\
+批大小 & 256 \\
+学习率 & $3 \times 10^{-4}$ \\
+裁剪系数$\epsilon$ & 0.2 \\
+熵系数 & 0.01 \\
+折扣因子$\gamma$ & 0.99 \\
+GAE $\lambda$ & 0.95 \\
+GPU & 1 $\times$ RTX 5090（单卡）\\
+\bottomrule
+\end{tabular}
+\end{table}
+
+注意：PPO阶段强制使用单卡，避免RTX 5090上
+\texttt{torch.distributed.barrier()}引发的CUDA内存访问异常。
+
+\subsection{实验结果}
+
+\subsubsection{主要结果}
+
+\todo{本节待填入Module C v5结果。下表中v3数字仅供参考，v5完成后替换。}
+
+表\ref{tab:moduleC_main}对比了Module C与两个基线策略：
+Rule-based（l\_risk$\geq3$即REJECT，其余PASS）
+和Threshold Baseline（按风险分数设定各动作阈值）。
+
+\begin{table}[ht]
+\centering
+\caption{Module C干预策略对比（测试集，$n=1,486$）}
+\label{tab:moduleC_main}
+\begin{tabular}{lccccc}
+\toprule
+方法 & SafetyRecall & OverRefusal & ActionAcc & CrisisPrecision & UX Fscore \\
+\midrule
+Rule-based & 0.908 & 0.000 & — & — & 0.952 \\
+Threshold & 0.908 & 0.000 & — & 0.624 & 0.952 \\
+LLM-as-judge & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
+\midrule
+\textbf{Ours（RL v5）} & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
+（参考：RL v3） & 1.000 & 0.004 & 0.575 & 0.421 & 0.998 \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+\subsubsection{各风险等级动作分布}
+
+表\ref{tab:per_level_action}展示三种方法在各风险等级上的动作分布，
+直观体现了RL策略的细粒度判断能力。
+
+\begin{table}[ht]
+\centering
+\caption{各风险等级动作分布（测试集，v3结果，v5待替换）}
+\label{tab:per_level_action}
+\resizebox{\textwidth}{!}{%
+\begin{tabular}{llrrrrrr}
+\toprule
+方法 & 等级 & $n$ & PASS & WARN & REWRITE & REJECT & CRISIS \\
+\midrule
+\multirow{5}{*}{Rule-based}
+& L0 Safe & 237 & 1.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
+& L1 Mild & 280 & 0.918 & 0.000 & 0.000 & 0.082 & 0.000 \\
+& L2 Moderate & 317 & 0.420 & 0.000 & 0.000 & 0.580 & 0.000 \\
+& L3 High & 456 & 0.114 & 0.000 & 0.000 & 0.886 & 0.000 \\
+& L4 Critical & 196 & 0.041 & 0.000 & 0.000 & 0.959 & 0.000 \\
+\midrule
+\multirow{5}{*}{Threshold}
+& L0 Safe & 237 & 1.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
+& L1 Mild & 280 & 0.843 & 0.075 & 0.082 & 0.000 & 0.000 \\
+& L2 Moderate & 317 & 0.044 & 0.375 & 0.552 & 0.000 & 0.028 \\
+& L3 High & 456 & 0.009 & 0.105 & 0.739 & 0.000 & 0.147 \\
+& L4 Critical & 196 & 0.000 & 0.041 & 0.316 & 0.000 & 0.643 \\
+\midrule
+\multirow{5}{*}{\textbf{Ours（RL v3参考）}}
+& L0 Safe & 237 & 0.987 & 0.008 & 0.004 & 0.000 & 0.000 \\
+& L1 Mild & 280 & 0.729 & 0.011 & 0.229 & 0.000 & 0.032 \\
+& L2 Moderate & 317 & 0.000 & 0.000 & 0.902 & 0.000 & 0.098 \\
+& L3 High & 456 & 0.000 & 0.000 & 0.871 & 0.000 & 0.129 \\
+& L4 Critical & 196 & 0.000 & 0.000 & 0.633 & 0.000 & 0.367 \\
+\bottomrule
+\end{tabular}
+}
+\end{table}
+
+RL策略的核心优势在于：
+（1）L2-L3层级主要选择REWRITE（改写）而非简单REJECT，
+平衡了安全性与用户体验；
+（2）L3/L4样本的PASS率为0.0\%，安全召回率达1.0，
+而规则基线由于检测器等级预测误差（level\_weighted\_f1=0.559）
+导致9.2\%的高危样本被错误放行。
+
+\subsubsection{消融实验}
+
+\todo{消融实验待补充（BC-only / w/o category-specific reward / v5完成后）}