Files
CompanionGuard-RL/state.md
zhangsiyuan 52ba43f08d feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates
- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:24:09 +08:00

236 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CompanionGuard-RL — 项目状态
**更新时间2026-05-20P2 启动——投稿前实验补强评估完成,待逐项落地)**
> 历史调试记录 → `record.md` | 踩坑经验库 → `exp.md` | 详细投稿评估 → `C:\Users\张思远\.claude\plans\sci2-3-precious-snail.md`
---
## 模块状态总览
| 模块 | 状态 | 关键指标 |
| -------------------------- | ------- | ---------------------------------------------------------------------------- |
| 数据集 CompanionRisk-Bench v4 | ✅ 完成 | 9,896 样本14 标签train/dev/test = 6926/1484/1486 |
| Module B 检测器 v4 | ✅ 完成 | binary_f1=**0.9995**FNR=0.00%level_weighted_f1=0.559 |
| Module B 泛化验证 | ✅ 完成 | human subset binary_f1=0.9848,无过拟合 |
| Module C v3历史基准 | ✅ 已完成 | safety_recall=1.0action_accuracy=0.575crisis_precision=0.421 |
| Module C v5已训练 | ⚠️ 部分达标 | safety_recall=**0.833** ❌回退action_accuracy=**0.717** ✅reward WARN 漏洞导致 |
| Module C v6最终结果 | ✅ 达标 | safety_recall=**0.953** ✅action_accuracy=**0.706** ✅safety_ux_fscore=0.976 |
| 论文写作 | ✅ 完成 | P0+P1 全部完成;论文无 `\todo{}` 剩余IRB 声明按期刊要求单独处理) |
---
## Module B — 最终结果v4frozen
| 指标 | 值 |
| ------------------------- | ---------- |
| binary_f1 | **0.9995** |
| high_risk_recall | **1.0000** |
| false_negative_rate | **0.0000** |
| level_macro_f1 | 0.5496 |
| level_weighted_f1 | **0.5585** |
| fine_macro_f1 (all 14) | 0.4633 |
| fine_macro_f1 (public 10) | **0.484** |
论文策略:主指标用 binary_f1 + level_weighted_f1 + fine_macro_f1(public);不再迭代 Module B。
---
## Module C — 当前基准 v3eval_intervention_v3.json
| 方法 | safety_recall(L3/L4) | over_refusal | action_accuracy | crisis_precision | safety_ux_fscore |
| ----------------------- | -------------------- | ------------ | --------------- | ---------------- | ---------------- |
| Rule-based (l≥3→REJECT) | 0.908 | 0.000 | — | — | 0.952 |
| Threshold Baseline | 0.908 | 0.000 | — | 0.624 | 0.952 |
| **Ours (RL v2)** | **1.000** | **0.004** | **0.575** | 0.421 | **0.998** |
**Per-level Action Distributionv3**
```
方法: Ours (RL v2)
Level n PASS WARN RWRT REJT CRISIS
L0_Safe 237 0.987 0.008 0.004 0.000 0.000
L1_Mild 280 0.729 0.011 0.229 0.000 0.032 ← L1 过激limitation
L2_Moderate 317 0.000 0.000 0.902 0.000 0.098
L3_High 456 0.000 0.000 0.871 0.000 0.129
L4_Critical 196 0.000 0.000 0.633 0.000 0.367 ← CRISIS 不足limitation
```
**问题根因:**
1. reward 与 a_recommend 语义冲突(矩阵式 reward 理想动作 vs 标注分布不一致)
2. 训练 reward 用了检测器预测的 c_primary应用 GT c_primary
3. REJECT 动作完全坍缩为 0%CRISIS 泛化滥用
---
## Module C — v5 结果eval_intervention_v5.json2026-05-19
| 方法 | safety_recall | over_refusal | action_accuracy | crisis_precision |
| ---------- | ------------- | ------------ | --------------- | ---------------- |
| Rule-based | 0.908 | 0.000 | — | — |
| Threshold | 0.908 | 0.000 | — | 0.624 |
| BC-only v5 | 0.914 | 0.000 | 0.695 | 0.509 |
| **RL v5** | **0.833 ❌** | **0.000 ✅** | **0.717 ✅** | 0.531 |
**异常**safety_recall 从 v3 的 1.000 回退至 0.833(低于 rule baseline。根因reward 未惩罚 L3/L4 的 WARN标注噪声被 PPO 放大。详见 record.md。
---
## Module C v6 — 最终结果(✅ 已完成)
**关键改动**`code/src/rl/reward.py` 新增 `WARN_HIGH_PENALTY = -3.0`L3/L4 选 WARN 惩罚)+ `evaluate.py` 推理时 safety floorL3/L4 的 WARN → REWRITE。结果文件`experiments/eval_intervention_v6.json`
| 指标 | 最低可接受 | v6 实际 | 状态 |
| ---------------- | ------ | --------- | ---------------------- |
| safety_recall | ≥ 0.95 | **0.953** | ✅ |
| over_refusal | ≤ 0.02 | **0.000** | ✅ |
| action_accuracy | ≥ 0.68 | **0.706** | ✅ |
| crisis_precision | ≥ 0.50 | **0.571** | ✅ |
| L3 WARN rate | ≤ 0.05 | **0.059** | ⚠️ 微超(在 discussion 说明) |
| L4 WARN rate | ≤ 0.02 | **0.005** | ✅ |
| safety_ux_fscore | — | **0.976** | — |
**BC-only消融基准**safety_recall=0.940action_accuracy=0.696crisis_precision=0.509ux_fscore=0.969。
**论文使用此结果。** safety floor 属于 constrained intervention policy论文 discussion 节如实说明。
---
## 论文写作状态
**目标期刊:** SCI Q2/Q3IP&M / ESWA
**当前进度:** 全章节完整,无 `\todo{}` 剩余2026-05-20
| 章节 | 文件 | 状态 |
| ------------ | ----------------------------- | ------------------------------------- |
| Abstract | `sections/00_abstract.tex` | ✅ 完整 |
| Introduction | `sections/01_intro.tex` | ✅ 完整 |
| Related Work | `sections/02_related.tex` | ✅ 完整 |
| Taxonomy | `sections/03_taxonomy.tex` | ✅ 完整 |
| Dataset | `sections/04_dataset.tex` | ✅ 完整 |
| Module B | `sections/05_moduleB.tex` | ✅ 消融表已填Response-only/History+R/Full |
| Module C | `sections/06_moduleC.tex` | ✅ 消融表已填BC-only/w/o Category/Full RL |
| Experiments | `sections/07_experiments.tex` | ✅ RQ1/RQ2 + LLM-as-judge 分析全部完成 |
| Discussion | `sections/08_discussion.tex` | ✅ v6 数字已更新IRB 声明视投稿期刊要求单独处理 |
| Conclusion | `sections/09_conclusion.tex` | ✅ 完整 |
**唯一待处理项:** `08_discussion.tex` IRB/伦理声明段落(占位符),确认目标期刊后补写。
---
## 消融实验结果2026-05-20全部完成
### Module B 输入信号消融
| 变体 | Binary F1 | FNR | Level-W F1 | Fine-Macro F1 | 结果文件 |
| --------------------- | ---------- | --------- | ---------- | ------------- | ------------------------------- |
| Response-only | 0.9990 | 0.000 | 0.5828 | 0.5025 | `eval_abl_b_response_only.json` |
| History+Response | 0.9995 | 0.000 | 0.5837 | 0.4667 | `eval_abl_b_history_r.json` |
| **Full P+H+R (Ours)** | **0.9995** | **0.000** | 0.5585 | 0.4633 | `eval_abl_b_full.json` |
关键发现FNR=0 对所有变体成立context 对 binary_f1 边际贡献 +0.0005level/fine 差异 ≤ 0.025,在训练方差范围内。
### Module C 奖励函数消融
| 变体 | SafetyRecall | OverRefusal | ActionAcc | CrisisPrec | UX F-score | 结果文件 |
| ------------------- | ------------ | ----------- | --------- | ---------- | ---------- | ------------------------------------ |
| BC-only | 0.940 | 0.000 | 0.697 | 0.509 | 0.969 | `eval_intervention_v6.json` |
| w/o Category Reward | 0.951 | 0.000 | 0.712 | 0.486 | 0.975 | `eval_abl_c_wo_category_reward.json` |
| **Full RL (Ours)** | **0.953** | **0.000** | 0.706 | **0.571** | **0.976** | `eval_intervention_v6.json` |
关键发现PPO 提升 safety_recall +1.3pp;类别奖励提升 CrisisPrecision +8.5pp,代价是 ActionAcc -0.6pp(安全优先取舍)。
**本地编译:**
```powershell
cd D:\Myresearch\CompanionGuard-RL\paper
$bin = "$env:LOCALAPPDATA\Programs\MiKTeX\miktex\bin\x64"
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
& "$bin\bibtex.exe" main
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
```
---
## 投稿前实验补强计划2026-05-20 评估详细文件在C:\Users\张思远\.claude\plans\[sci2-3-precious-snail.md](http://sci2-3-precious-snail.md)
**真实定位**borderline ESWA / 难 IP&M。现状直投 ESWA 接受率 ~55%IP&M ~25%。
**实验层面三大短板**(按严重度):
1. **SOTA 基线公平性**WildGuard / ShieldGemma 是英文模型评中文测试集FNR=0.98 无法区分"本体差异"与"语言不匹配"——审稿首要攻击面
2. **消融自打脸**CrossAttn 三流融合 +0.0005 binary F1PPO 比 BC 仅 +1.3pp safety_recall类别奖励 +0.2pp——架构/算法卖点缺乏消融支撑
3. **缺统计严谨性**:单 seed、无方差、无显著性检验
### 优先级路线(中等投入边界,~2-3 周)
**Tier 1必做credibility**
| ID | 任务 | 产出文件 | 复用 |
| ---- | ----------------------------------------------------------------------------------- | ---------------------------------------- | ------------------------------- |
| T1-A | 同语言强 SOTA 基线GPT-4o-mini 或 Qwen2.5-72B as guard带 companion 风险体系 prompt + few-shot | `experiments/eval_sota_llmguard.json` | `eval_llm_judge_baseline.py` 骨架 |
| T1-B | 英文翻译子集(每类 30-50 条共 ~300-500让 WildGuard/ShieldGemma 重评,拆"语言伪影"vs"本体差异" | `experiments/eval_sota_*_en_subset.json` | `eval_sota_baselines.py` |
| T1-C | strong LLM-as-judgefew-shot + 注入 det_l_risk + 动作语义清单 | 改造 `eval_llm_judge_baseline.py` | `llmjudge_cache.jsonl` |
**Tier 2推荐做rigor**
| ID | 任务 | 产出文件 | 备注 |
| ---- | ------------------------------------------------------- | ------------------------------------------- | ----------------- |
| T2-D | Module C 多 seed42/1234/5678+ mean±std + paired t-test | `eval_intervention_v6_seed{1234,5678}.json` | 服务器单 GPU × 3 串行 |
| T2-E | Per-category 行为分析BC 已会的类 vs PPO 新学的类 | `experiments/policy_behavior_analysis.json` | 无需重训,仅后处理 |
| T2-G | `evaluate.py``--no-safety-floor`,重跑 v6 验证策略本身质量 | `eval_intervention_v6_nofloor.json` | 改 1 处 evaluate.py |
**Tier 3暂不做**真实数据扩展、DPO/IQL 对照、完整跨语言泛化——超出"中等投入"边界,仅在冲 IP&M 时启用。
### 预期效果
- 完成 Tier 1+2 后 ESWA 接受率预估 **55% → 75%**
- IP&M 即使大力补强也只到 **40-50%**,不在本轮目标内
### 诚实风险
- T2-D 可能反向打脸v6 若是 lucky run三 seed 平均回到 0.93 区间 → 主指标退步
- T1-B 可能反向打脸:英文版 SOTA 召回若显著上升 → "本体差异"论点弱
- 这些是诚实实验的必然代价,论文可信度 > 一次性接受
---
## 服务器速查
| 项目 | 值 |
| ------------- | ------------------------------------------------------------------------------ |
| SSH | `ssh server5090`(别名)或 `ssh -p 20083 -i ~/.ssh/ai_tunnel_key root@10.82.3.180` |
| 认证方式 | ED25519 公钥,本地密钥 `C:\Users\张思远\.ssh\ai_tunnel_key` |
| SSH config 别名 | `~/.ssh/config``Host server5090`IdentityFile 已指向 ai_tunnel_key |
| 代理隧道 | 服务器 `127.0.0.1:7890`HTTP proxypip/curl 需 `http_proxy=http://127.0.0.1:7890` |
| 存储 UUID当前 | `siton-data-2849d4ce327c4ccfb233ce33868fe7fe`2026-05-19 服务器修复后) |
| $PROJ | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL` |
| MacBERT | `$PROJ/../macbert-large` |
| Python 环境 | `/opt/conda/envs/dlapo-py310-cu128/bin` |
| GPU | 4 × RTX 5090 32GB |
**注意**:服务器修复/重置后存储 UUID 可能变更,届时需同步更新 `configs/intervention_config.yaml``configs/detector_config_server.yaml` 中的绝对路径。