feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates
- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -119,8 +119,6 @@ GPU & 1 $\times$ RTX 5090(单卡)\\
|
||||
|
||||
\subsubsection{主要结果}
|
||||
|
||||
\todo{本节待填入Module C v5结果。下表中v3数字仅供参考,v5完成后替换。}
|
||||
|
||||
表\ref{tab:moduleC_main}对比了Module C与两个基线策略:
|
||||
Rule-based(l\_risk$\geq3$即REJECT,其余PASS)
|
||||
和Threshold Baseline(按风险分数设定各动作阈值)。
|
||||
@@ -135,10 +133,10 @@ Rule-based(l\_risk$\geq3$即REJECT,其余PASS)
|
||||
\midrule
|
||||
Rule-based & 0.908 & 0.000 & — & — & 0.952 \\
|
||||
Threshold & 0.908 & 0.000 & — & 0.624 & 0.952 \\
|
||||
LLM-as-judge & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
|
||||
BC-only(行为克隆) & 0.940 & 0.000 & 0.696 & 0.509 & 0.969 \\
|
||||
LLM-as-judge(Qwen2.5-72B) & 0.397 & 0.211 & 0.374 & 0.250 & 0.528 \\
|
||||
\midrule
|
||||
\textbf{Ours(RL v5)} & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
|
||||
(参考:RL v3) & 1.000 & 0.004 & 0.575 & 0.421 & 0.998 \\
|
||||
\textbf{Ours(RL)} & \textbf{0.953} & \textbf{0.000} & \textbf{0.706} & 0.571 & \textbf{0.976} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@@ -150,7 +148,7 @@ LLM-as-judge & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
|
||||
|
||||
\begin{table}[ht]
|
||||
\centering
|
||||
\caption{各风险等级动作分布(测试集,v3结果,v5待替换)}
|
||||
\caption{各风险等级动作分布(测试集,Module C v6,含推理时 safety floor)}
|
||||
\label{tab:per_level_action}
|
||||
\resizebox{\textwidth}{!}{%
|
||||
\begin{tabular}{llrrrrrr}
|
||||
@@ -171,12 +169,12 @@ LLM-as-judge & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
|
||||
& L3 High & 456 & 0.009 & 0.105 & 0.739 & 0.000 & 0.147 \\
|
||||
& L4 Critical & 196 & 0.000 & 0.041 & 0.316 & 0.000 & 0.643 \\
|
||||
\midrule
|
||||
\multirow{5}{*}{\textbf{Ours(RL v3参考)}}
|
||||
& L0 Safe & 237 & 0.987 & 0.008 & 0.004 & 0.000 & 0.000 \\
|
||||
& L1 Mild & 280 & 0.729 & 0.011 & 0.229 & 0.000 & 0.032 \\
|
||||
& L2 Moderate & 317 & 0.000 & 0.000 & 0.902 & 0.000 & 0.098 \\
|
||||
& L3 High & 456 & 0.000 & 0.000 & 0.871 & 0.000 & 0.129 \\
|
||||
& L4 Critical & 196 & 0.000 & 0.000 & 0.633 & 0.000 & 0.367 \\
|
||||
\multirow{5}{*}{\textbf{Ours(RL)}}
|
||||
& L0 Safe & 237 & 1.000 & 0.000 & 0.000 & 0.000 & 0.000 \\
|
||||
& L1 Mild & 280 & 0.821 & 0.071 & 0.100 & 0.007 & 0.000 \\
|
||||
& L2 Moderate & 317 & 0.025 & 0.271 & 0.593 & 0.069 & 0.041 \\
|
||||
& L3 High & 456 & 0.007 & 0.059 & 0.711 & 0.154 & 0.070 \\
|
||||
& L4 Critical & 196 & 0.000 & 0.005 & 0.214 & 0.474 & 0.306 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
}
|
||||
@@ -185,10 +183,47 @@ LLM-as-judge & \todo{} & \todo{} & \todo{} & \todo{} & \todo{} \\
|
||||
RL策略的核心优势在于:
|
||||
(1)L2-L3层级主要选择REWRITE(改写)而非简单REJECT,
|
||||
平衡了安全性与用户体验;
|
||||
(2)L3/L4样本的PASS率为0.0\%,安全召回率达1.0,
|
||||
而规则基线由于检测器等级预测误差(level\_weighted\_f1=0.559)
|
||||
导致9.2\%的高危样本被错误放行。
|
||||
(2)L3/L4样本的PASS率$\leq$0.7\%,safety\_recall达0.953,
|
||||
较规则基线(0.908)提升4.5pp;而规则基线由于检测器等级预测误差
|
||||
(level\_weighted\_f1=0.559)导致9.2\%的高危样本被错误放行。
|
||||
L4层级CRISIS动作占30.6\%,高于Threshold基线(CRISIS限于此层级),
|
||||
体现了RL策略对最高危场景的主动识别能力。
|
||||
策略包含推理时safety floor:将L3/L4上的WARN动作强制映射为REWRITE,
|
||||
属于constrained intervention policy设计,确保高危场景不被轻度回应。
|
||||
|
||||
\subsubsection{消融实验}
|
||||
|
||||
\todo{消融实验待补充(BC-only / w/o category-specific reward / v5完成后)}
|
||||
为量化各训练阶段和奖励组件的贡献,
|
||||
我们对Module C进行三组对照实验:
|
||||
(1)\textbf{BC-only}:仅行为克隆热启动,跳过PPO强化学习阶段;
|
||||
(2)\textbf{w/o Category Reward}:BC+PPO完整训练,但移除类别特定奖励项
|
||||
(即禁用CRISIS\_R1奖励、REJECT\_R6R7奖励、REWRITE\_companion奖励和
|
||||
CRISIS\_misuse惩罚,保留对齐信号和安全硬约束);
|
||||
(3)\textbf{Full RL(完整模型)}:保留所有奖励组件的BC+PPO训练。
|
||||
|
||||
\begin{table}[ht]
|
||||
\centering
|
||||
\caption{Module C干预策略消融实验(测试集,$n=1,486$,含safety floor约束)。}
|
||||
\label{tab:moduleC_ablation}
|
||||
\begin{tabular}{lccccc}
|
||||
\toprule
|
||||
变体 & SafetyRecall & OverRefusal & ActionAcc & CrisisPrec & UX F-score \\
|
||||
\midrule
|
||||
BC-only & 0.940 & 0.000 & 0.697 & 0.509 & 0.969 \\
|
||||
w/o Category Reward & 0.951 & 0.000 & \textbf{0.712} & 0.486 & 0.975 \\
|
||||
\midrule
|
||||
\textbf{Full RL(Ours)} & \textbf{0.953} & \textbf{0.000} & 0.706 & \textbf{0.571} & \textbf{0.976} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
PPO阶段将safety\_recall从0.940(BC-only)提升至0.953(+1.3pp),
|
||||
验证了强化学习对安全召回的正向贡献。
|
||||
类别特定奖励对ActionAcc的影响为轻微下降(0.712$\to$0.706,$-$0.6pp),
|
||||
但显著提升CrisisPrecision(0.486$\to$0.571,+8.5pp):
|
||||
CRISIS\_R1\_BONUS引导策略在R1类自伤样本上优先使用CRISIS动作,
|
||||
CRISIS\_misuse惩罚则抑制了将非危机内容误判为CRISIS的过度响应,
|
||||
两者合力使策略在动作校准上更加精准。
|
||||
ActionAcc的边际下降源于类别特定奖励驱使策略偏离部分a\_recommend标注
|
||||
(例如:标注建议REWRITE的R1样本被策略合理地升级为CRISIS),
|
||||
属于安全优先的设计取舍。
|
||||
|
||||
Reference in New Issue
Block a user