feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates
- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -35,31 +35,40 @@
|
||||
|
||||
\textbf{检测基线}:
|
||||
L1a(关键词匹配)、L1b(正则词典)、L1c(组合);
|
||||
\todo{L2:Llama Guard v2、WildGuard、OpenAI Moderation(待运行)}
|
||||
L2a(ShieldGemma-2B,binary F1=0.027,FNR=0.987)、L2b(WildGuard,binary F1=0.038,FNR=0.981)
|
||||
|
||||
\textbf{干预基线}:
|
||||
Rule-based($l_\text{risk} \geq 3$即REJECT,其余PASS)、
|
||||
Threshold Baseline(按风险分数阈值映射动作)、
|
||||
\todo{LLM-as-judge(Qwen2.5-72B直接判断,待运行)}
|
||||
LLM-as-judge(Qwen/Qwen2.5-72B-Instruct零样本直接判断干预动作,temperature=0)
|
||||
|
||||
\subsection{RQ1:检测性能分析}
|
||||
|
||||
详细结果见第\ref{sec:moduleB}节表\ref{tab:moduleB_main}和表\ref{tab:per_category_recall}。
|
||||
|
||||
Module B在所有指标上大幅优于基线。
|
||||
值得关注的是,通用守卫模型(\todo{Llama Guard v2、WildGuard})
|
||||
在伴侣特有风险类别(R3情感操纵、R4现实隔离等)上的召回率
|
||||
预期显著低于整体水平,
|
||||
体现了CompanionRisk Taxonomy的必要性。
|
||||
值得关注的是,两款通用守卫模型均严重失效:
|
||||
ShieldGemma-2B(FNR=0.987)与WildGuard(FNR=0.981)
|
||||
在R3情感操纵、R4现实隔离、R10越界亲密等伴侣特有类别上召回率为0.0\%,
|
||||
整体漏检率甚至高于简单关键词规则基线(L1c FNR=0.816)。
|
||||
这一结果表明,通用安全分类体系与中文伴侣场景之间存在系统性偏差,
|
||||
而本文Module B(FNR=0.000)通过专属分类体系和上下文感知架构有效弥补了这一差距。
|
||||
|
||||
\subsection{RQ2:干预策略比较}
|
||||
|
||||
\todo{本节主要结果待Module C v5完成后填入。}
|
||||
|
||||
核心发现(基于v3结果):
|
||||
RL策略在safety\_recall(1.0 vs 0.908)和
|
||||
UX F-score(0.998 vs 0.952)上均优于两个基线策略,
|
||||
证明了可学习干预策略相比固定规则的优越性。
|
||||
RL策略(safety\_recall=0.953,UX F-score=0.976)
|
||||
显著优于所有基线。
|
||||
LLM-as-judge(Qwen2.5-72B零样本)表现最差(safety\_recall=0.397,over\_refusal=0.211,UX F-score=0.528):
|
||||
逐级动作分布分析显示,该模型对L3/L4高风险内容倾向输出WARN而非REWRITE(L3高风险中PASS+WARN占63.6\%),
|
||||
同时对11.0\%的安全样本误判为CRISIS,表明在伴侣场景专属五动作空间下,
|
||||
零样本LLM在安全与体验的双向校准上存在系统性困难;
|
||||
这进一步说明了针对该任务进行专项强化学习训练的必要性。
|
||||
Rule-based(0.908 / 0.952)和Threshold(0.908 / 0.952)基线虽简单,其safety\_recall反而高于零样本LLM。
|
||||
RL策略在action\_accuracy(0.706)上较纯行为克隆BC-only(0.696)提升1.4pp,
|
||||
验证了PPO阶段对细粒度动作学习的必要性。
|
||||
BC-only虽可达到较高safety\_recall(0.940),
|
||||
但其action\_accuracy和crisis\_precision均低于完整RL策略,
|
||||
说明强化学习阶段有效改善了动作精度。
|
||||
|
||||
\subsection{RQ3:消融实验}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user