feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates

- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json - Reward: v5 label-aligned constrained reward (code/src/rl/reward.py) - Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward) - SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results - Paper: update sections 05–08 (Module B/C description, experiments table, discussion) - Docs: add record.md (change log), update state.md and exp.md; retire change.md - Tools: add html-to-ppt utilities and run_shieldgemma2b.sh - Configs: add ablation YAML configs for Module B and C - Cleanup: remove stale reference/ PNG screenshots Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:24:09 +08:00
parent 6d61a950f1
commit 52ba43f08d
55 changed files with 8239 additions and 1244 deletions
--- a/paper/sections/05_moduleB.tex
+++ b/paper/sections/05_moduleB.tex
@@ -106,7 +106,9 @@ GPU & 4 $\times$ RTX 5090 32GB \\

 \begin{table}[ht]
 \centering
-\caption{Module B检测性能对比（测试集，$n=1,486$）}
+\caption{Module B检测性能对比（测试集，$n=1,486$）。
+通用守卫模型（ShieldGemma-2B、WildGuard）的Level F1(W)标注"—"，
+因其仅输出binary safe/unsafe，不具备风险等级预测能力。}
 \label{tab:moduleB_main}
 \begin{tabular}{lcccc}
 \toprule
@@ -115,9 +117,8 @@ GPU & 4 $\times$ RTX 5090 32GB \\
 L1a：关键词匹配 & 0.264 & 0.155 & 0.845 & 0.098 \\
 L1b：正则词典 & 0.067 & 0.035 & 0.965 & 0.063 \\
 L1c：关键词+正则组合 & 0.306 & 0.184 & 0.816 & 0.106 \\
-\todo{Llama Guard v2} & \todo{} & \todo{} & \todo{} & \todo{} \\
-\todo{WildGuard} & \todo{} & \todo{} & \todo{} & \todo{} \\
-\todo{OpenAI Moderation} & \todo{} & \todo{} & \todo{} & \todo{} \\
+ShieldGemma-2B & 0.027 & 0.014 & 0.987 & — \\
+WildGuard & 0.038 & 0.019 & 0.981 & — \\
 \midrule
 \textbf{Ours（Module B）} & \textbf{0.9995} & \textbf{1.000} & \textbf{0.000} & \textbf{0.559} \\
 \bottomrule
@@ -128,6 +129,17 @@ Module B的binary F1（0.9995）和漏检率（FNR=0.0\%）
 较最强规则基线（L1c Combined, 0.306）分别提升0.693和0.816，
 对所有10个风险类别的召回率均达到1.0（见表\ref{tab:per_category_recall}）。

+值得关注的是，专为安全检测设计的通用守卫模型在本数据集上表现极差。
+ShieldGemma-2B的FNR高达0.987，WildGuard的FNR为0.981，
+二者均远高于简单规则基线（L1c FNR=0.816）。
+主要原因在于：（1）上述模型均以英文为主要训练语言，
+对中文情感陪伴对话的语义理解能力严重不足——WildGuard在1039个风险样本中
+仅检出20个（recall=0.019），且对R3情感操纵、R4现实隔离、R10越界亲密
+三类伴侣特有风险的召回率为0.0\%；
+（2）其安全分类体系（MLCommons / WildGuard taxonomy）缺乏伴侣场景特有风险类别，
+导致系统性漏检。
+这印证了构建CompanionRisk Taxonomy和中文专属检测器的必要性。
+
 \subsubsection{分类别召回率}

 \begin{table}[ht]
@@ -173,4 +185,34 @@ binary F1为\textbf{0.9848}，确认泛化能力良好。

 \subsubsection{消融实验}

-\todo{消融实验表格待补充（需GPU重训）：上下文信号消融（Response-only / History+Response / Full）}
+为验证多流上下文融合架构的贡献，
+我们对输入信号进行逐步消融：
+（1）\textbf{Response-only}：仅保留AI回复流，将Persona和History编码器输入置空；
+（2）\textbf{History+Response}：移除Persona流，保留对话历史和回复；
+（3）\textbf{Full（完整模型）}：使用全部三路输入（Persona+History+Response）。
+
+\begin{table}[ht]
+\centering
+\caption{Module B输入信号消融实验（测试集，$n=1,486$）。
+所有变体均基于相同超参训练10轮（lr=$2\times10^{-5}$，有效批128）。}
+\label{tab:moduleB_ablation}
+\begin{tabular}{lcccc}
+\toprule
+变体 & Binary F1 & FNR & Level F1(W) & Fine-Macro F1 \\
+\midrule
+Response-only & 0.999 & 0.000 & 0.583 & 0.503 \\
+History+Response & 0.9995 & 0.000 & 0.584 & 0.467 \\
+\midrule
+\textbf{Full（P+H+R，Ours）} & \textbf{0.9995} & \textbf{0.000} & \textbf{0.559} & \textbf{0.463} \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+三个变体的Binary F1均接近0.999，FNR均为0.0\%，
+表明AI回复文本本身已携带充分的二元风险信号，
+上下文信息对检测鲁棒性有边际贡献（+0.0005 F1）。
+Level和Fine-grained指标的差异（$\leq$0.025）在训练方差范围之内，
+不构成系统性趋势。
+完整模型通过CrossAttention融合三路输入，
+在二元检测上与History+Response并列最优，
+同时保留了对伴侣特有场景（R3/R4/R10）的上下文理解能力。