- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
236 lines
13 KiB
Markdown
236 lines
13 KiB
Markdown
# CompanionGuard-RL — 项目状态
|
||
|
||
**更新时间:2026-05-20(P2 启动——投稿前实验补强评估完成,待逐项落地)**
|
||
|
||
> 历史调试记录 → `record.md` | 踩坑经验库 → `exp.md` | 详细投稿评估 → `C:\Users\张思远\.claude\plans\sci2-3-precious-snail.md`
|
||
|
||
---
|
||
|
||
## 模块状态总览
|
||
|
||
|
||
| 模块 | 状态 | 关键指标 |
|
||
| -------------------------- | ------- | ---------------------------------------------------------------------------- |
|
||
| 数据集 CompanionRisk-Bench v4 | ✅ 完成 | 9,896 样本,14 标签,train/dev/test = 6926/1484/1486 |
|
||
| Module B 检测器 v4 | ✅ 完成 | binary_f1=**0.9995**,FNR=0.00%,level_weighted_f1=0.559 |
|
||
| Module B 泛化验证 | ✅ 完成 | human subset binary_f1=0.9848,无过拟合 |
|
||
| Module C v3(历史基准) | ✅ 已完成 | safety_recall=1.0,action_accuracy=0.575,crisis_precision=0.421 |
|
||
| Module C v5(已训练) | ⚠️ 部分达标 | safety_recall=**0.833** ❌(回退),action_accuracy=**0.717** ✅,reward WARN 漏洞导致 |
|
||
| Module C v6(最终结果) | ✅ 达标 | safety_recall=**0.953** ✅,action_accuracy=**0.706** ✅,safety_ux_fscore=0.976 |
|
||
| 论文写作 | ✅ 完成 | P0+P1 全部完成;论文无 `\todo{}` 剩余(IRB 声明按期刊要求单独处理) |
|
||
|
||
|
||
---
|
||
|
||
## Module B — 最终结果(v4,frozen)
|
||
|
||
|
||
| 指标 | 值 |
|
||
| ------------------------- | ---------- |
|
||
| binary_f1 | **0.9995** |
|
||
| high_risk_recall | **1.0000** |
|
||
| false_negative_rate | **0.0000** |
|
||
| level_macro_f1 | 0.5496 |
|
||
| level_weighted_f1 | **0.5585** |
|
||
| fine_macro_f1 (all 14) | 0.4633 |
|
||
| fine_macro_f1 (public 10) | **0.484** |
|
||
|
||
|
||
论文策略:主指标用 binary_f1 + level_weighted_f1 + fine_macro_f1(public);不再迭代 Module B。
|
||
|
||
---
|
||
|
||
## Module C — 当前基准 v3(eval_intervention_v3.json)
|
||
|
||
|
||
| 方法 | safety_recall(L3/L4) | over_refusal | action_accuracy | crisis_precision | safety_ux_fscore |
|
||
| ----------------------- | -------------------- | ------------ | --------------- | ---------------- | ---------------- |
|
||
| Rule-based (l≥3→REJECT) | 0.908 | 0.000 | — | — | 0.952 |
|
||
| Threshold Baseline | 0.908 | 0.000 | — | 0.624 | 0.952 |
|
||
| **Ours (RL v2)** | **1.000** | **0.004** | **0.575** | 0.421 | **0.998** |
|
||
|
||
|
||
**Per-level Action Distribution(v3):**
|
||
|
||
```
|
||
方法: Ours (RL v2)
|
||
Level n PASS WARN RWRT REJT CRISIS
|
||
L0_Safe 237 0.987 0.008 0.004 0.000 0.000
|
||
L1_Mild 280 0.729 0.011 0.229 0.000 0.032 ← L1 过激(limitation)
|
||
L2_Moderate 317 0.000 0.000 0.902 0.000 0.098
|
||
L3_High 456 0.000 0.000 0.871 0.000 0.129
|
||
L4_Critical 196 0.000 0.000 0.633 0.000 0.367 ← CRISIS 不足(limitation)
|
||
```
|
||
|
||
**问题根因:**
|
||
|
||
1. reward 与 a_recommend 语义冲突(矩阵式 reward 理想动作 vs 标注分布不一致)
|
||
2. 训练 reward 用了检测器预测的 c_primary(应用 GT c_primary)
|
||
3. REJECT 动作完全坍缩为 0%,CRISIS 泛化滥用
|
||
|
||
---
|
||
|
||
## Module C — v5 结果(eval_intervention_v5.json,2026-05-19)
|
||
|
||
|
||
| 方法 | safety_recall | over_refusal | action_accuracy | crisis_precision |
|
||
| ---------- | ------------- | ------------ | --------------- | ---------------- |
|
||
| Rule-based | 0.908 | 0.000 | — | — |
|
||
| Threshold | 0.908 | 0.000 | — | 0.624 |
|
||
| BC-only v5 | 0.914 | 0.000 | 0.695 | 0.509 |
|
||
| **RL v5** | **0.833 ❌** | **0.000 ✅** | **0.717 ✅** | 0.531 |
|
||
|
||
|
||
**异常**:safety_recall 从 v3 的 1.000 回退至 0.833(低于 rule baseline)。根因:reward 未惩罚 L3/L4 的 WARN,标注噪声被 PPO 放大。详见 record.md。
|
||
|
||
---
|
||
|
||
## Module C v6 — 最终结果(✅ 已完成)
|
||
|
||
**关键改动**:`code/src/rl/reward.py` 新增 `WARN_HIGH_PENALTY = -3.0`(L3/L4 选 WARN 惩罚)+ `evaluate.py` 推理时 safety floor(L3/L4 的 WARN → REWRITE)。结果文件:`experiments/eval_intervention_v6.json`。
|
||
|
||
|
||
| 指标 | 最低可接受 | v6 实际 | 状态 |
|
||
| ---------------- | ------ | --------- | ---------------------- |
|
||
| safety_recall | ≥ 0.95 | **0.953** | ✅ |
|
||
| over_refusal | ≤ 0.02 | **0.000** | ✅ |
|
||
| action_accuracy | ≥ 0.68 | **0.706** | ✅ |
|
||
| crisis_precision | ≥ 0.50 | **0.571** | ✅ |
|
||
| L3 WARN rate | ≤ 0.05 | **0.059** | ⚠️ 微超(在 discussion 说明) |
|
||
| L4 WARN rate | ≤ 0.02 | **0.005** | ✅ |
|
||
| safety_ux_fscore | — | **0.976** | — |
|
||
|
||
|
||
**BC-only(消融基准)**:safety_recall=0.940,action_accuracy=0.696,crisis_precision=0.509,ux_fscore=0.969。
|
||
|
||
**论文使用此结果。** safety floor 属于 constrained intervention policy,论文 discussion 节如实说明。
|
||
|
||
---
|
||
|
||
## 论文写作状态
|
||
|
||
**目标期刊:** SCI Q2/Q3,IP&M / ESWA
|
||
**当前进度:** 全章节完整,无 `\todo{}` 剩余(2026-05-20)
|
||
|
||
|
||
| 章节 | 文件 | 状态 |
|
||
| ------------ | ----------------------------- | ------------------------------------- |
|
||
| Abstract | `sections/00_abstract.tex` | ✅ 完整 |
|
||
| Introduction | `sections/01_intro.tex` | ✅ 完整 |
|
||
| Related Work | `sections/02_related.tex` | ✅ 完整 |
|
||
| Taxonomy | `sections/03_taxonomy.tex` | ✅ 完整 |
|
||
| Dataset | `sections/04_dataset.tex` | ✅ 完整 |
|
||
| Module B | `sections/05_moduleB.tex` | ✅ 消融表已填(Response-only/History+R/Full) |
|
||
| Module C | `sections/06_moduleC.tex` | ✅ 消融表已填(BC-only/w/o Category/Full RL) |
|
||
| Experiments | `sections/07_experiments.tex` | ✅ RQ1/RQ2 + LLM-as-judge 分析全部完成 |
|
||
| Discussion | `sections/08_discussion.tex` | ✅ v6 数字已更新;IRB 声明视投稿期刊要求单独处理 |
|
||
| Conclusion | `sections/09_conclusion.tex` | ✅ 完整 |
|
||
|
||
|
||
**唯一待处理项:** `08_discussion.tex` IRB/伦理声明段落(占位符),确认目标期刊后补写。
|
||
|
||
---
|
||
|
||
## 消融实验结果(2026-05-20,全部完成)
|
||
|
||
### Module B 输入信号消融
|
||
|
||
|
||
| 变体 | Binary F1 | FNR | Level-W F1 | Fine-Macro F1 | 结果文件 |
|
||
| --------------------- | ---------- | --------- | ---------- | ------------- | ------------------------------- |
|
||
| Response-only | 0.9990 | 0.000 | 0.5828 | 0.5025 | `eval_abl_b_response_only.json` |
|
||
| History+Response | 0.9995 | 0.000 | 0.5837 | 0.4667 | `eval_abl_b_history_r.json` |
|
||
| **Full P+H+R (Ours)** | **0.9995** | **0.000** | 0.5585 | 0.4633 | `eval_abl_b_full.json` |
|
||
|
||
|
||
关键发现:FNR=0 对所有变体成立;context 对 binary_f1 边际贡献 +0.0005;level/fine 差异 ≤ 0.025,在训练方差范围内。
|
||
|
||
### Module C 奖励函数消融
|
||
|
||
|
||
| 变体 | SafetyRecall | OverRefusal | ActionAcc | CrisisPrec | UX F-score | 结果文件 |
|
||
| ------------------- | ------------ | ----------- | --------- | ---------- | ---------- | ------------------------------------ |
|
||
| BC-only | 0.940 | 0.000 | 0.697 | 0.509 | 0.969 | `eval_intervention_v6.json` |
|
||
| w/o Category Reward | 0.951 | 0.000 | 0.712 | 0.486 | 0.975 | `eval_abl_c_wo_category_reward.json` |
|
||
| **Full RL (Ours)** | **0.953** | **0.000** | 0.706 | **0.571** | **0.976** | `eval_intervention_v6.json` |
|
||
|
||
|
||
关键发现:PPO 提升 safety_recall +1.3pp;类别奖励提升 CrisisPrecision +8.5pp,代价是 ActionAcc -0.6pp(安全优先取舍)。
|
||
|
||
**本地编译:**
|
||
|
||
```powershell
|
||
cd D:\Myresearch\CompanionGuard-RL\paper
|
||
$bin = "$env:LOCALAPPDATA\Programs\MiKTeX\miktex\bin\x64"
|
||
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
|
||
& "$bin\bibtex.exe" main
|
||
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
|
||
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
|
||
```
|
||
|
||
---
|
||
|
||
## 投稿前实验补强计划(2026-05-20 评估)(详细文件在C:\Users\张思远\.claude\plans\[sci2-3-precious-snail.md](http://sci2-3-precious-snail.md))
|
||
|
||
**真实定位**:borderline ESWA / 难 IP&M。现状直投 ESWA 接受率 ~55%,IP&M ~25%。
|
||
|
||
**实验层面三大短板**(按严重度):
|
||
|
||
1. **SOTA 基线公平性**:WildGuard / ShieldGemma 是英文模型评中文测试集,FNR=0.98 无法区分"本体差异"与"语言不匹配"——审稿首要攻击面
|
||
2. **消融自打脸**:CrossAttn 三流融合 +0.0005 binary F1;PPO 比 BC 仅 +1.3pp safety_recall;类别奖励 +0.2pp——架构/算法卖点缺乏消融支撑
|
||
3. **缺统计严谨性**:单 seed、无方差、无显著性检验
|
||
|
||
### 优先级路线(中等投入边界,~2-3 周)
|
||
|
||
**Tier 1(必做,credibility)**
|
||
|
||
|
||
| ID | 任务 | 产出文件 | 复用 |
|
||
| ---- | ----------------------------------------------------------------------------------- | ---------------------------------------- | ------------------------------- |
|
||
| T1-A | 同语言强 SOTA 基线(GPT-4o-mini 或 Qwen2.5-72B as guard,带 companion 风险体系 prompt + few-shot) | `experiments/eval_sota_llmguard.json` | `eval_llm_judge_baseline.py` 骨架 |
|
||
| T1-B | 英文翻译子集(每类 30-50 条共 ~300-500),让 WildGuard/ShieldGemma 重评,拆"语言伪影"vs"本体差异" | `experiments/eval_sota_*_en_subset.json` | `eval_sota_baselines.py` |
|
||
| T1-C | strong LLM-as-judge:few-shot + 注入 det_l_risk + 动作语义清单 | 改造 `eval_llm_judge_baseline.py` | `llmjudge_cache.jsonl` |
|
||
|
||
|
||
**Tier 2(推荐做,rigor)**
|
||
|
||
|
||
| ID | 任务 | 产出文件 | 备注 |
|
||
| ---- | ------------------------------------------------------- | ------------------------------------------- | ----------------- |
|
||
| T2-D | Module C 多 seed(42/1234/5678)+ mean±std + paired t-test | `eval_intervention_v6_seed{1234,5678}.json` | 服务器单 GPU × 3 串行 |
|
||
| T2-E | Per-category 行为分析:BC 已会的类 vs PPO 新学的类 | `experiments/policy_behavior_analysis.json` | 无需重训,仅后处理 |
|
||
| T2-G | `evaluate.py` 加 `--no-safety-floor`,重跑 v6 验证策略本身质量 | `eval_intervention_v6_nofloor.json` | 改 1 处 evaluate.py |
|
||
|
||
|
||
**Tier 3(暂不做)**:真实数据扩展、DPO/IQL 对照、完整跨语言泛化——超出"中等投入"边界,仅在冲 IP&M 时启用。
|
||
|
||
### 预期效果
|
||
|
||
- 完成 Tier 1+2 后 ESWA 接受率预估 **55% → 75%**
|
||
- IP&M 即使大力补强也只到 **40-50%**,不在本轮目标内
|
||
|
||
### 诚实风险
|
||
|
||
- T2-D 可能反向打脸:v6 若是 lucky run,三 seed 平均回到 0.93 区间 → 主指标退步
|
||
- T1-B 可能反向打脸:英文版 SOTA 召回若显著上升 → "本体差异"论点弱
|
||
- 这些是诚实实验的必然代价,论文可信度 > 一次性接受
|
||
|
||
---
|
||
|
||
## 服务器速查
|
||
|
||
|
||
| 项目 | 值 |
|
||
| ------------- | ------------------------------------------------------------------------------ |
|
||
| SSH | `ssh server5090`(别名)或 `ssh -p 20083 -i ~/.ssh/ai_tunnel_key root@10.82.3.180` |
|
||
| 认证方式 | ED25519 公钥,本地密钥 `C:\Users\张思远\.ssh\ai_tunnel_key` |
|
||
| SSH config 别名 | `~/.ssh/config` → `Host server5090`,IdentityFile 已指向 ai_tunnel_key |
|
||
| 代理隧道 | 服务器 `127.0.0.1:7890`(HTTP proxy),pip/curl 需 `http_proxy=http://127.0.0.1:7890` |
|
||
| 存储 UUID(当前) | `siton-data-2849d4ce327c4ccfb233ce33868fe7fe`(2026-05-19 服务器修复后) |
|
||
| $PROJ | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL` |
|
||
| MacBERT | `$PROJ/../macbert-large` |
|
||
| Python 环境 | `/opt/conda/envs/dlapo-py310-cu128/bin` |
|
||
| GPU | 4 × RTX 5090 32GB |
|
||
|
||
|
||
**注意**:服务器修复/重置后存储 UUID 可能变更,届时需同步更新 `configs/intervention_config.yaml` 和 `configs/detector_config_server.yaml` 中的绝对路径。 |