feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates
- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
698
state.md
698
state.md
@@ -1,523 +1,164 @@
|
||||
# CompanionGuard-RL — 项目进度快照
|
||||
**更新时间:2026-05-15(论文 LaTeX 框架已搭建,paper/ 目录就绪,22页可编译)**
|
||||
# CompanionGuard-RL — 项目状态
|
||||
|
||||
> 📖 **可复用经验库** → 见 [`exp.md`](exp.md)(RTX 5090 NCCL、PyYAML 陷阱、分布式 Tensor 设备一致性、CRLF 等 12 类经验)
|
||||
**更新时间:2026-05-20(P2 启动——投稿前实验补强评估完成,待逐项落地)**
|
||||
|
||||
> 历史调试记录 → `record.md` | 踩坑经验库 → `exp.md` | 详细投稿评估 → `C:\Users\张思远\.claude\plans\sci2-3-precious-snail.md`
|
||||
|
||||
---
|
||||
|
||||
## 总体进度
|
||||
## 模块状态总览
|
||||
|
||||
|
||||
| 模块 | 状态 | 关键指标 |
|
||||
| -------------------------- | ------- | ---------------------------------------------------------------------------- |
|
||||
| 数据集 CompanionRisk-Bench v4 | ✅ 完成 | 9,896 样本,14 标签,train/dev/test = 6926/1484/1486 |
|
||||
| Module B 检测器 v4 | ✅ 完成 | binary_f1=**0.9995**,FNR=0.00%,level_weighted_f1=0.559 |
|
||||
| Module B 泛化验证 | ✅ 完成 | human subset binary_f1=0.9848,无过拟合 |
|
||||
| Module C v3(历史基准) | ✅ 已完成 | safety_recall=1.0,action_accuracy=0.575,crisis_precision=0.421 |
|
||||
| Module C v5(已训练) | ⚠️ 部分达标 | safety_recall=**0.833** ❌(回退),action_accuracy=**0.717** ✅,reward WARN 漏洞导致 |
|
||||
| Module C v6(最终结果) | ✅ 达标 | safety_recall=**0.953** ✅,action_accuracy=**0.706** ✅,safety_ux_fscore=0.976 |
|
||||
| 论文写作 | ✅ 完成 | P0+P1 全部完成;论文无 `\todo{}` 剩余(IRB 声明按期刊要求单独处理) |
|
||||
|
||||
| 模块 | 状态 | 关键指标 |
|
||||
|------|------|---------|
|
||||
| 数据集 CompanionRisk-Bench v4 | ✅ 完成 | 9,896 样本,全 14 标签覆盖 |
|
||||
| Module B — 检测器 v4 | ✅ **完成** | binary_f1=0.9995, level_macro_f1=0.550 |
|
||||
| Module B — 泛化性验证 | ✅ 完成 | human subset binary_f1=0.9848,无过拟合 |
|
||||
| Module C — RL 干预策略 | ✅ **完成** | 1-GPU 模式 BC+PPO 200k steps 收敛,safety_recall=1.0,over_refusal=0.0 |
|
||||
| 论文写作 | 🔄 **进行中** | LaTeX 框架完成,22页可编译;方法节写完;结果节等 v5 + SOTA baseline |
|
||||
|
||||
---
|
||||
|
||||
## 一、数据集 CompanionRisk-Bench(最终版 v4)
|
||||
## Module B — 最终结果(v4,frozen)
|
||||
|
||||
### 规模
|
||||
| 分割 | 样本数 |
|
||||
|------|--------|
|
||||
| train | 6,926 |
|
||||
| dev | 1,484 |
|
||||
| test | 1,486 |
|
||||
| **total** | **9,896** |
|
||||
|
||||
### 数据来源构成
|
||||
| 来源 | 样本数 | 说明 |
|
||||
|------|--------|------|
|
||||
| LLM 核心集(Qwen2.5-72B via SiliconFlow) | 8,000 | 中文,10 类风险 + safe |
|
||||
| 弱标签专项集(generate_targeted.py) | 1,083 | FalseReassurance/PseudoTherapy/IsolationReinforcement 单标签增强 |
|
||||
| Human-AI Suicide Risk Dataset | 393 | 英文,R1 危机类 |
|
||||
| CoSafe Dataset | 420 | 多类别对话安全 |
|
||||
| DICES-990 | ~200(质检后约 0 入库) | 质检未通过(history_too_short) |
|
||||
| 指标 | 值 |
|
||||
| ------------------------- | ---------- |
|
||||
| binary_f1 | **0.9995** |
|
||||
| high_risk_recall | **1.0000** |
|
||||
| false_negative_rate | **0.0000** |
|
||||
| level_macro_f1 | 0.5496 |
|
||||
| level_weighted_f1 | **0.5585** |
|
||||
| fine_macro_f1 (all 14) | 0.4633 |
|
||||
| fine_macro_f1 (public 10) | **0.484** |
|
||||
|
||||
### 生成路径
|
||||
```
|
||||
scripts/generate_siliconflow.py → data/raw/generated_core.jsonl (8000条)
|
||||
scripts/generate_targeted.py → data/raw/generated_targeted.jsonl (1083条)
|
||||
scripts/adapt_public_datasets.py → data/raw/adapted_*.jsonl
|
||||
scripts/merge_and_split.py → data/processed/CompanionRisk-Bench/{train,dev,test,all}.jsonl
|
||||
```
|
||||
|
||||
### 细粒度标签训练集覆盖(全部 ≥ 30 条)
|
||||
| 标签 | 全集数量 | 训练集 |
|
||||
|------|---------|--------|
|
||||
| RiskNormalization | 1,787 | 1,235 |
|
||||
| DirectEncouragement | 1,292 | 921 |
|
||||
| FalseReassurance | 1,290 | 905 |
|
||||
| BoundaryFailure | 1,157 | 800 |
|
||||
| PseudoTherapy | 1,090 | 767 |
|
||||
| IsolationReinforcement | 991 | 693 |
|
||||
| ManipulativeAttachment | 752 | 534 |
|
||||
| DependencyReinforcement | 787 | 537 |
|
||||
| CoRumination | 638 | ~440 |
|
||||
| CrisisNonResponse | 594 | ~410 |
|
||||
| AgeInappropriateIntimacy | 583 | ~410 |
|
||||
| PrivacySolicitation | 530 | 370 |
|
||||
| MethodFacilitation | 683 | 489 |
|
||||
| Romanticization | 437 | 310 |
|
||||
论文策略:主指标用 binary_f1 + level_weighted_f1 + fine_macro_f1(public);不再迭代 Module B。
|
||||
|
||||
---
|
||||
|
||||
## 二、Module B — 检测器训练(最终版 v4)
|
||||
## Module C — 当前基准 v3(eval_intervention_v3.json)
|
||||
|
||||
### 模型架构
|
||||
- **Backbone**:`hfl/chinese-macbert-large`(1024 hidden, LoRA=off)
|
||||
- 服务器本地路径:`/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/macbert-large`
|
||||
- **融合层**:CrossAttention(response 为 query,persona+context 为 key/value)
|
||||
- **输出头**:4 个分类头(y_risk / l_risk / c_primary / c_fine)
|
||||
|
||||
### 训练配置(v4)
|
||||
| 参数 | 值 |
|
||||
|------|----|
|
||||
| GPU | 4 × RTX 5090 32GB |
|
||||
| Effective batch | 128(16 × 4 GPU × 2 accum) |
|
||||
| Epochs | 10 |
|
||||
| LR | 2e-5,线性 warmup 100 steps |
|
||||
| Mixed precision | bf16 |
|
||||
| fine_loss_weight | 2.0 |
|
||||
| fine_training.use_pos_weight | true(max clip=30) |
|
||||
| fine_training.risky_only | true |
|
||||
| 方法 | safety_recall(L3/L4) | over_refusal | action_accuracy | crisis_precision | safety_ux_fscore |
|
||||
| ----------------------- | -------------------- | ------------ | --------------- | ---------------- | ---------------- |
|
||||
| Rule-based (l≥3→REJECT) | 0.908 | 0.000 | — | — | 0.952 |
|
||||
| Threshold Baseline | 0.908 | 0.000 | — | 0.624 | 0.952 |
|
||||
| **Ours (RL v2)** | **1.000** | **0.004** | **0.575** | 0.421 | **0.998** |
|
||||
|
||||
### v2 → v3 → v4 关键指标演进
|
||||
| 指标 | v2(4022条) | v3(8813条) | v4(9896条) |
|
||||
|------|-------------|-------------|-------------|
|
||||
| binary_f1 | 0.9848 | 0.9989 | **0.9995** |
|
||||
| high_risk_recall | — | 0.9989 | **1.0000** |
|
||||
| FNR | 1.52% | 0.11% | **0.00%** |
|
||||
| level_macro_f1 | ~0.43 | 0.497 | **0.550** |
|
||||
| level_weighted_f1 | — | 0.511 | **0.559** |
|
||||
| L1 Mild F1 | ~0 | 0.174 | **0.635** |
|
||||
| fine_macro_f1 (all 14类) | 0.000 (bug) | 0.476 | 0.463 |
|
||||
| fine_macro_f1 (public 10类) | — | — | **0.484** |
|
||||
|
||||
### v4 细粒度标签 F1(全量 test)
|
||||
| 标签 | v3 F1 | v4 F1 | 变化 |
|
||||
|------|-------|-------|------|
|
||||
| DirectEncouragement | 0.705 | 0.684 | → |
|
||||
| RiskNormalization | 0.627 | 0.698 | ↑ |
|
||||
| AgeInappropriateIntimacy | 0.694 | 0.616 | → |
|
||||
| BoundaryFailure | 0.609 | 0.532 | → |
|
||||
| DependencyReinforcement | 0.625 | 0.585 | → |
|
||||
| FalseReassurance | 0.279 | **0.383** | ↑ +0.104 ✅ |
|
||||
| PseudoTherapy | 0.239 | **0.338** | ↑ +0.099 ✅ |
|
||||
| IsolationReinforcement | 0.288 | **0.356** | ↑ +0.068 ✅ |
|
||||
| ManipulativeAttachment | 0.444 | 0.441 | → |
|
||||
| MethodFacilitation | 0.403 | 0.466 | ↑ |
|
||||
| Romanticization | 0.434 | 0.402 | → |
|
||||
| CoRumination | 0.350 | 0.269 | ↓ (targeted 副作用) |
|
||||
| CrisisNonResponse | 0.588 | 0.394 | ↓ (targeted 副作用) |
|
||||
| PrivacySolicitation | 0.373 | 0.321 | → |
|
||||
|
||||
### 论文汇报策略
|
||||
- **主指标**:`binary_f1=0.9995`,`level_weighted_f1=0.559`,`fine_macro_f1(public)=0.484`
|
||||
- level_macro_f1 下降的 L0 问题:使用 weighted 指标,注释"test set risky:safe = 2.33:1"
|
||||
- fine companion 4 类(DependencyReinforcement/IsolationReinforcement/CoRumination/ManipulativeAttachment)单独在表格中列出,说明"companion-specific,无人工标注参考集"
|
||||
- 不再迭代 Module B,CoRumination/CrisisNonResponse 的轻微下降在 limitation 一句话说明
|
||||
|
||||
### checkpoint 路径(服务器)
|
||||
```
|
||||
checkpoints/detector/best.pt ← Module C 使用此权重(frozen)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、L1 规则基线 vs Ours(v4 test,n=1486)
|
||||
|
||||
| 方法 | BinaryF1 | Recall | FNR | LevelF1(weighted) |
|
||||
|------|---------|--------|-----|-------------------|
|
||||
| L1a Keyword | 0.264 | 0.155 | 0.845 | 0.098 |
|
||||
| L1b Regex | 0.067 | 0.035 | 0.965 | 0.063 |
|
||||
| L1c Combined | 0.306 | 0.184 | 0.816 | 0.106 |
|
||||
| **Ours (Module B)** | **0.9995** | **1.000** | **0.000** | **0.559** |
|
||||
|
||||
---
|
||||
|
||||
## 四、Module C — RL 干预策略(当前阶段)
|
||||
|
||||
### 任务目标
|
||||
在 Module B 检测器输出的状态向量上,用 PPO 训练一个干预动作策略,学习对不同风险等级和类别选择最优干预动作(PASS/WARN/REWRITE/REJECT/CRISIS)。
|
||||
|
||||
### 核心文件
|
||||
| 文件 | 说明 |
|
||||
|------|------|
|
||||
| `scripts/train_intervention.py` | 两阶段训练主脚本(BC warmup + PPO,支持4-GPU分布式) |
|
||||
| `configs/intervention_config.yaml` | 完整训练配置 |
|
||||
| `src/models/intervention_agent.py` | Actor-Critic 网络(已修复 _encode_obs 维度 bug) |
|
||||
| `src/rl/companion_env.py` | Gymnasium 兼容的离线 RL 环境 |
|
||||
| `src/rl/ppo_trainer.py` | PPO 训练器(RolloutBuffer + GAE + ppo_update) |
|
||||
| `src/rl/reward.py` | 多目标奖励函数(safety + anti-over-refusal + UX) |
|
||||
| `src/utils/preprocessing.py` | detector → RL 状态向量转换 |
|
||||
|
||||
### 状态向量结构(obs_dim = 2065)
|
||||
```
|
||||
[d_score(1) | l_risk_onehot(5) | c_primary_probs(10) | e_H_pool(1024) | e_P_pool(1024) | t_norm(1)]
|
||||
= 1 + 5 + 10 + 1024 + 1024 + 1 = 2065
|
||||
```
|
||||
|
||||
### 重要 Bug 修复记录(已完成)
|
||||
- `intervention_agent.py`:原来 `actor` 第一层 `Linear(256,256)` 直接接 2065-dim 原始 obs 会崩溃。
|
||||
已添加 `_encode_obs()` 方法,在 `forward()` 内先解析 flat obs → StateEncoder → 256-dim latent。
|
||||
- `companion_env.py`、`InterventionAgent` 默认 `detector_hidden` 已从 768 改为 1024。
|
||||
|
||||
### 训练参数(intervention_config.yaml)
|
||||
```yaml
|
||||
behavior_cloning:
|
||||
enabled: true
|
||||
epochs: 5
|
||||
per_gpu_batch_size: 256
|
||||
lr: 1e-3
|
||||
|
||||
ppo:
|
||||
total_timesteps: 200000
|
||||
n_rollout_steps: 2048
|
||||
n_epochs: 4
|
||||
batch_size: 256
|
||||
lr: 3e-4
|
||||
clip_eps: 0.2
|
||||
entropy_coef: 0.01
|
||||
gamma: 0.99
|
||||
gae_lambda: 0.95
|
||||
|
||||
reward:
|
||||
w1: 2.0 # safety gain
|
||||
w2: 3.0 # false negative penalty
|
||||
w3: 4.0 # crisis bonus (R1)
|
||||
w4: 1.5 # over-refusal penalty
|
||||
w5: 0.5 # UX cost
|
||||
```
|
||||
|
||||
### 训练命令(服务器,当前最新版)
|
||||
```bash
|
||||
cd /root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL
|
||||
|
||||
# 注意:需要同时禁用 SHM 和 P2P,否则 RTX 5090 NCCL 报 CUDA illegal memory access
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_SHM_DISABLE=1 NCCL_P2P_DISABLE=1 \
|
||||
/opt/conda/envs/dlapo-py310-cu128/bin/accelerate launch \
|
||||
--num_processes=4 --mixed_precision=bf16 \
|
||||
scripts/train_intervention.py \
|
||||
--config configs/intervention_config.yaml \
|
||||
--train-data data/processed/CompanionRisk-Bench/train.jsonl \
|
||||
> experiments/train_intervention_$(date +%Y%m%d_%H%M%S).log 2>&1 &
|
||||
```
|
||||
|
||||
### 评估命令(训练完成后)
|
||||
```bash
|
||||
python scripts/evaluate.py \
|
||||
--detector-ckpt checkpoints/detector/best.pt \
|
||||
--agent-ckpt checkpoints/intervention/final.pt \
|
||||
--test-data data/processed/CompanionRisk-Bench/test.jsonl \
|
||||
--config configs/detector_config_server.yaml \
|
||||
--intervention-config configs/intervention_config.yaml \
|
||||
--output experiments/eval_intervention_v1.json
|
||||
```
|
||||
|
||||
### 成功标准(Module C)
|
||||
| 指标 | 目标 | 说明 |
|
||||
|------|------|------|
|
||||
| safety_recall(高风险正确处理率) | > 0.85 | L3/L4 被 REWRITE/REJECT/CRISIS |
|
||||
| over_refusal_rate(安全内容误拦截) | < 0.10 | y_risk=0 被 REWRITE+ |
|
||||
| action_accuracy(vs a_recommend) | > 0.70 | 与标注推荐动作吻合率 |
|
||||
| crisis_precision(R1 选 CRISIS 精度) | > 0.80 | 关键安全保障 |
|
||||
|
||||
### Module C 调试记录(时序)
|
||||
|
||||
| # | 错误 | 根因 | 修复位置 |
|
||||
|---|------|------|---------|
|
||||
| 1 | `ModuleNotFoundError: gymnasium` | dlapo 环境无此包 | `cp -r .../gymnasium .../site-packages/` |
|
||||
| 2 | `ModuleNotFoundError: wandb`(复杂依赖链) | 环境缺 wandb 及其依赖 | `train_intervention.py` + `ppo_trainer.py` 改为 `try/except` 导入;`use_wandb: false` |
|
||||
| 3 | `OSError: Can't load hfl/chinese-macbert-large` | 服务器无公网 | `intervention_config.yaml` 改为本地绝对路径 |
|
||||
| 4 | `RuntimeError: No backend type associated with device type cpu` | `torch.distributed.broadcast` 不支持 CPU tensor | `train_intervention.py` broadcast 段改为先 `.to(accelerator.device)` 再广播 |
|
||||
| 5 | `TypeError: '<=' not supported between float and str` | PyYAML 6.x 将 `1e-3` 解析为字符串 | `intervention_config.yaml` 改为 `0.001` / `0.0003` |
|
||||
| 6 | `AttributeError: SequentialSampler has no set_epoch` | DataLoader 使用 SequentialSampler 而非 DistributedSampler | `train_intervention.py` 加 `if hasattr(loader.sampler, "set_epoch"):` guard |
|
||||
| 7 | `RuntimeError: cannot pin torch.cuda.FloatTensor` | `pin_memory=True` 要求 CPU tensor,但 tensor 已在 GPU | `train_intervention.py` L~116 加 `.cpu()` 后再构建 TensorDataset |
|
||||
| 8 | **`CUDA error: an illegal memory access`(BC 后 PPO 开始)** | `accelerator.wait_for_everyone()` → `torch.distributed.barrier()` 在 RTX 5090 NCCL 下崩溃,与 NCCL_P2P 无关 | **修复**:改用 `--num_processes=1` 单 GPU 运行,完全绕开 NCCL barrier |
|
||||
|
||||
### 当前状态(2026-05-12 最终)
|
||||
|
||||
**✅ Module C 训练完成(单 GPU 模式):**
|
||||
```
|
||||
Running on 1 GPU(s), mixed_precision=bf16
|
||||
=== Stage 1: Behavior Cloning (1 GPU) === ← BC 正常收敛
|
||||
=== Stage 2: PPO Fine-tuning (GPU-0) ===
|
||||
[PPO] Update 98 | Steps 200704/200000 ← PPO 完成
|
||||
Training complete. Final model: checkpoints/intervention/final.pt
|
||||
```
|
||||
|
||||
**训练命令(已验证可用):**
|
||||
```bash
|
||||
cd /root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL
|
||||
export PYTHONPATH=$PWD
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
/opt/conda/envs/dlapo-py310-cu128/bin/accelerate launch \
|
||||
--num_processes=1 --mixed_precision=bf16 \
|
||||
scripts/train_intervention.py \
|
||||
--config configs/intervention_config.yaml \
|
||||
--train-data data/processed/CompanionRisk-Bench/train.jsonl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、服务器信息
|
||||
|
||||
### 服务器1(主训练机,当前被占用)
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|----|
|
||||
| SSH | `ssh -p 20083 root@10.82.3.180` |
|
||||
| 密码 | `m2dGcwyrhI` |
|
||||
| 项目目录 | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL` |
|
||||
| MacBERT 路径 | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/macbert-large` |
|
||||
| 环境 | `/opt/conda/envs/dlapo-py310-cu128`(torch 2.7.1+cu128,transformers 5.8.0) |
|
||||
| GPU | 4 × RTX 5090 32GB |
|
||||
|
||||
### 服务器2(当前使用)
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|----|
|
||||
| SSH | `ssh -p 20060 root@10.82.3.180` |
|
||||
| 密码 | `zwfn65xjTY` |
|
||||
| 项目目录 | `/root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/my-reasearch/companionguard-rl` |
|
||||
| MacBERT 路径 | 需同步(见下文) |
|
||||
| 环境 | `/root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/env/dlapo-py310-cu128`(从服务器1迁移) |
|
||||
| GPU | 2 × RTX 5090 32GB |
|
||||
| 存储 | NFS 1TB(`siton-data-740d234e02d749f08fe5347b0c74c49f`) |
|
||||
|
||||
> **服务器2训练命令(已验证可用路径):**
|
||||
> ```bash
|
||||
> PROJ=/root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/my-reasearch/companionguard-rl
|
||||
> PY=/root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/env/dlapo-py310-cu128/bin
|
||||
> cd $PROJ && export PYTHONPATH=$PROJ
|
||||
>
|
||||
> # 单GPU(推荐,避免NCCL)
|
||||
> CUDA_VISIBLE_DEVICES=0 $PY/accelerate launch --num_processes=1 --mixed_precision=bf16 \
|
||||
> scripts/train_intervention.py --config configs/intervention_config.yaml \
|
||||
> --train-data data/processed/CompanionRisk-Bench/train.jsonl
|
||||
>
|
||||
> # detector_config_server.yaml 需将 model.name 改为:
|
||||
> # /root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/macbert-large
|
||||
> ```
|
||||
|
||||
### 两台服务器说明
|
||||
- 同一宿主机 `10.82.3.180`,不同 Docker 容器,不同端口
|
||||
- 容器间互通:服务器1可 ssh 到 `172.17.0.1:20060` 访问服务器2
|
||||
- Host key 相同:`SHA256:nAMVofPMCFZxa0DyOO2Olepfnp1MzZGdMyW7j5OekQI`
|
||||
|
||||
---
|
||||
|
||||
## 六、代码同步状态(2026-05-12 晚更新)
|
||||
|
||||
### 本地 ↔ 服务器1 同步(已完成)
|
||||
| 操作 | 文件 | 说明 |
|
||||
|------|------|------|
|
||||
| 服务器1→本地 | `src/rl/ppo_trainer.py` | 服务器调试版(MD5不同),已下载 |
|
||||
| 本地→服务器1 | `checkpoints/intervention/final_v2.pt` | v2权重命名版,已上传 |
|
||||
| 跳过 | `checkpoints/detector/best.pt / final.pt` | 字节完全一致(1,352,746,854 B) |
|
||||
| 跳过 | `src/utils/preprocessing.py` | MD5一致 |
|
||||
|
||||
### 服务器1 → 服务器2 同步(已完成)
|
||||
| 内容 | 状态 |
|
||||
|------|------|
|
||||
| `src/`(18个py文件) | ✅ |
|
||||
| `scripts/` | ✅ |
|
||||
| `configs/` | ✅ |
|
||||
| `data/processed/CompanionRisk-Bench/`(9896条) | ✅ |
|
||||
| `experiments/`(eval/train logs+json) | ✅ |
|
||||
| `checkpoints/detector/best.pt`(1.35GB) | ✅ |
|
||||
| `checkpoints/detector/final.pt`(1.35GB) | ✅ |
|
||||
| `checkpoints/intervention/final.pt + final_v2.pt` | ✅ |
|
||||
| `requirements.txt` | ✅ |
|
||||
| conda env `dlapo-py310-cu128`(7.7GB) | ✅ `/root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/env/dlapo-py310-cu128/`(torch 2.7.1+cu128 ✓,GPU×2 ✓) |
|
||||
| MacBERT 权重(1.3GB) | ✅ `/root/siton-data-740d234e02d749f08fe5347b0c74c49f/zsy/macbert-large/` |
|
||||
|
||||
### 关键文件清单(截至 2026-05-12)
|
||||
|
||||
| 文件 | 状态 | 说明 |
|
||||
|------|------|------|
|
||||
| `checkpoints/detector/best.pt` | ✅ 服务器1+2 + 本地 | v4 最优检测器权重(1.35GB) |
|
||||
| `data/processed/CompanionRisk-Bench/` | ✅ 服务器1+2 + 本地 | v4 数据集(9896条) |
|
||||
| `scripts/train_intervention.py` | ✅ 就绪 | Module C 训练脚本 |
|
||||
| `configs/intervention_config.yaml` | ✅ 就绪 | Module C 完整配置 |
|
||||
| `src/models/intervention_agent.py` | ✅ bug已修 | Actor-Critic(obs_dim=2065→256→actions) |
|
||||
| `src/rl/companion_env.py` | ✅ 就绪 | 离线 RL 环境 |
|
||||
| `src/rl/ppo_trainer.py` | ✅ 就绪 | PPO 训练器 |
|
||||
| `src/rl/reward.py` | ✅ 就绪 | 多目标奖励函数 |
|
||||
| `src/utils/preprocessing.py` | ✅ bug已修(v2) | build_obs_vector 改用 det_l_risk |
|
||||
| `src/utils/metrics.py` | ✅ bug已修(v2) | 新增 per_level_action_dist + action_accuracy |
|
||||
| `scripts/evaluate.py` | ✅ bug已修(v2) | rule policy 改用 det_l_risk,展示新指标 |
|
||||
| `experiments/eval_v4_all.log` | ✅ 本地 | v4 完整评估日志 |
|
||||
| `experiments/eval_v4_public.log` | ✅ 本地 | v4 public filter 评估日志 |
|
||||
| `checkpoints/intervention/final.pt` | ✅ 服务器 + 本地 | Module C PPO 最终权重(5.1MB) |
|
||||
| `experiments/eval_intervention_v1.json` | ✅ 本地 | Module C 评估 v1(有 bug,已废弃) |
|
||||
| `experiments/eval_intervention_v2.json` | ✅ 本地 | Module C 评估 v2(代码修复后,但模型仍用旧权重,废弃) |
|
||||
| `experiments/eval_intervention_v3.json` | ✅ 本地 | Module C 评估 v3(重训+修复,**论文用此**) |
|
||||
| `checkpoints/intervention/final_v2.pt` | ✅ 服务器 + 本地 | Module C PPO v2 权重(用 det_l_risk 重训,**论文用此**) |
|
||||
| `experiments/train_intervention_1gpu_20260512_165204.log` | ✅ 本地 | Module C 训练 v1 日志(旧,已废弃) |
|
||||
| `experiments/train_intervention_v2_20260512_172636.log` | ✅ 本地 | Module C 训练 v2 日志(det_l_risk,**论文用此**) |
|
||||
|
||||
---
|
||||
|
||||
## 五(补)、Module C 评估 Bug 修复记录
|
||||
|
||||
### v1 的两个问题(均已修复)
|
||||
|
||||
**Bug A — `build_obs_vector` 用了 ground truth `l_risk`**
|
||||
- **位置**:`src/utils/preprocessing.py:127`
|
||||
- **症状**:RL 状态向量含 ground truth 等级(部署时不可知),导致 safety_recall/over_refusal 结果不真实
|
||||
- **修复**:改用 `sample.get("det_l_risk", sample["l_risk"])`(优先检测器预测值)
|
||||
- **影响**:不需要重新训练(detector binary_f1=0.9995,两者几乎相同;但概念上正确)
|
||||
|
||||
**Bug B — 干预指标 `intervention_recall_high`=1.0、`over_refusal`=0.0 三方法无差别**
|
||||
- **位置**:`src/utils/metrics.py`
|
||||
- **症状**:阈值太粗(l_risk≥3 → action≥2)所有合理策略都能完美通过,无区分度
|
||||
- **修复**:新增 `per_level_action_dist`(按 ground truth 等级展示各动作占比)和 `action_accuracy`(vs a_recommend)
|
||||
- **附带**:`evaluate.py` 中 `run_rule_intervention` 的策略输入改为 `det_l_risk`,与部署一致
|
||||
|
||||
---
|
||||
|
||||
## 六、Module C 评估结果 v2(2026-05-12,论文用)
|
||||
|
||||
### 干预任务汇总指标
|
||||
|
||||
| 方法 | safety_recall(L3/L4) | over_refusal | action_accuracy | crisis_precision |
|
||||
|------|---------------------|--------------|-----------------|-----------------|
|
||||
| Rule-based (l≥3→REJECT) | 0.908 | 0.000 | — | — |
|
||||
| Threshold Baseline | 0.908 | 0.000 | — | 0.624 |
|
||||
| **Ours (RL, Module C)** | **1.000** | **0.000** | **0.587** | 0.470 |
|
||||
|
||||
> safety_recall 改为基于 `det_l_risk` 策略输入 vs ground truth level,Rule-based/Threshold 降至 0.908(9.2% L3/L4 样本被检测器预测为 <L3,因此 rule 给了 PASS/WARN)。RL 仍 1.0 说明它学到了超越 l_risk 阈值的综合判断。
|
||||
|
||||
### Per-level Action Distribution(核心论文表格)
|
||||
**Per-level Action Distribution(v3):**
|
||||
|
||||
```
|
||||
方法: Rule-based (l_risk≥3→REJECT)
|
||||
Level n PASS WARN RWRT REJT CRISIS
|
||||
L0_Safe 237 1.000 0.000 0.000 0.000 0.000
|
||||
L1_Mild 280 0.918 0.000 0.000 0.082 0.000
|
||||
L2_Moderate 317 0.420 0.000 0.000 0.580 0.000
|
||||
L3_High 456 0.114 0.000 0.000 0.886 0.000
|
||||
L4_Critical 196 0.041 0.000 0.000 0.959 0.000
|
||||
|
||||
方法: Threshold Baseline
|
||||
Level n PASS WARN RWRT REJT CRISIS
|
||||
L0_Safe 237 1.000 0.000 0.000 0.000 0.000
|
||||
L1_Mild 280 0.843 0.075 0.082 0.000 0.000
|
||||
L2_Moderate 317 0.044 0.375 0.552 0.000 0.028
|
||||
L3_High 456 0.009 0.105 0.739 0.000 0.147
|
||||
L4_Critical 196 0.000 0.041 0.316 0.000 0.643
|
||||
|
||||
方法: Ours (RL)
|
||||
Level n PASS WARN RWRT REJT CRISIS
|
||||
L0_Safe 237 0.983 0.017 0.000 0.000 0.000
|
||||
L1_Mild 280 0.754 0.004 0.218 0.000 0.025
|
||||
L2_Moderate 317 0.000 0.000 0.915 0.000 0.085
|
||||
L3_High 456 0.000 0.000 0.879 0.000 0.121
|
||||
L4_Critical 196 0.000 0.000 0.597 0.000 0.403
|
||||
```
|
||||
|
||||
### 成功标准达成情况(v2)
|
||||
|
||||
| 指标 | 目标 | RL实测 | 状态 |
|
||||
|------|------|------|------|
|
||||
| safety_recall(L3/L4 正确处理率) | > 0.85 | **1.000** | ✅ |
|
||||
| over_refusal_rate(safe 内容误拦截) | < 0.10 | **0.000** | ✅ |
|
||||
| action_accuracy(vs a_recommend) | > 0.70 | **0.587** | ⚠️ |
|
||||
| crisis_precision(CRISIS→L4 精度) | > 0.80 | **0.470** | ⚠️ |
|
||||
|
||||
### RL 策略解读(v2,已废弃,见 v3)
|
||||
- v2 基于旧权重(用 GT l_risk 训练)+ 新评估代码,存在 train/eval 不一致,仅作对照参考
|
||||
|
||||
---
|
||||
|
||||
## 七、Module C 最终结果 v3(重训 + 正确评估,论文用)
|
||||
|
||||
### 重训原因
|
||||
RL agent 训练时 state 向量包含 ground truth `l_risk`(非检测器预测),而检测器 level_macro_f1=0.55(各等级预测有误差),导致训练条件与部署不一致,需要用 `det_l_risk` 重训。
|
||||
|
||||
### 评估 v1 / v2 / v3 演进
|
||||
|
||||
| 版本 | 代码 | 模型 | 问题 |
|
||||
|------|------|------|------|
|
||||
| v1 | 旧(GT l_risk state, 无 per-level) | 旧(GT l_risk 训练) | 两个 bug,指标虚高 |
|
||||
| v2 | 新(det_l_risk state, 有 per-level) | 旧(GT l_risk 训练) | train/eval 不一致 |
|
||||
| **v3** | 新 | 新(det_l_risk 训练) | **论文使用** |
|
||||
|
||||
### 汇总指标(v3,最终)
|
||||
|
||||
| 方法 | safety_recall(L3/L4) | over_refusal | action_accuracy | crisis_precision | safety_ux_fscore |
|
||||
|------|---------------------|--------------|-----------------|-----------------|-----------------|
|
||||
| Rule-based (l≥3→REJECT) | 0.908 | 0.000 | — | — | 0.952 |
|
||||
| Threshold Baseline | 0.908 | 0.000 | — | 0.624 | 0.952 |
|
||||
| **Ours (RL v2)** | **1.000** | **0.004** | **0.575** | 0.421 | **0.998** |
|
||||
|
||||
### Per-level Action Distribution(v3,论文核心表格)
|
||||
|
||||
```
|
||||
方法: Rule-based (l_risk≥3→REJECT) 方法: Threshold Baseline
|
||||
Level n PASS WARN RWRT REJT CRISIS Level n PASS WARN RWRT REJT CRISIS
|
||||
L0_Safe 237 1.000 0.000 0.000 0.000 0.000 L0_Safe 237 1.000 0.000 0.000 0.000 0.000
|
||||
L1_Mild 280 0.918 0.000 0.000 0.082 0.000 L1_Mild 280 0.843 0.075 0.082 0.000 0.000
|
||||
L2_Moderate 317 0.420 0.000 0.000 0.580 0.000 L2_Moderate 317 0.044 0.375 0.552 0.000 0.028
|
||||
L3_High 456 0.114 0.000 0.000 0.886 0.000 L3_High 456 0.009 0.105 0.739 0.000 0.147
|
||||
L4_Critical 196 0.041 0.000 0.000 0.959 0.000 L4_Critical 196 0.000 0.041 0.316 0.000 0.643
|
||||
|
||||
方法: Ours (RL v2, 重训)
|
||||
方法: Ours (RL v2)
|
||||
Level n PASS WARN RWRT REJT CRISIS
|
||||
L0_Safe 237 0.987 0.008 0.004 0.000 0.000 ← over_refusal 0.4%(REWRITE)
|
||||
L1_Mild 280 0.729 0.011 0.229 0.000 0.032 ← 部分轻度误触发(limitation)
|
||||
L2_Moderate 317 0.000 0.000 0.902 0.000 0.098 ← REWRITE 主导 ✓
|
||||
L3_High 456 0.000 0.000 0.871 0.000 0.129 ← REWRITE 主导 ✓
|
||||
L4_Critical 196 0.000 0.000 0.633 0.000 0.367 ← CRISIS 偏低(limitation)
|
||||
L0_Safe 237 0.987 0.008 0.004 0.000 0.000
|
||||
L1_Mild 280 0.729 0.011 0.229 0.000 0.032 ← L1 过激(limitation)
|
||||
L2_Moderate 317 0.000 0.000 0.902 0.000 0.098
|
||||
L3_High 456 0.000 0.000 0.871 0.000 0.129
|
||||
L4_Critical 196 0.000 0.000 0.633 0.000 0.367 ← CRISIS 不足(limitation)
|
||||
```
|
||||
|
||||
### 成功标准达成情况(v3 最终)
|
||||
**问题根因:**
|
||||
|
||||
| 指标 | 目标 | RL v2 实测 | 状态 |
|
||||
|------|------|-----------|------|
|
||||
| safety_recall(L3/L4 正确处理率) | > 0.85 | **1.000** | ✅ |
|
||||
| over_refusal_rate(safe 内容误拦截) | < 0.10 | **0.004** | ✅ |
|
||||
| action_accuracy(vs a_recommend) | > 0.70 | **0.575** | ⚠️ |
|
||||
| crisis_precision(CRISIS→L4 精度) | > 0.80 | **0.421** | ⚠️ |
|
||||
|
||||
### 论文论点
|
||||
- **优势**:safety_recall=1.0(baseline 仅 0.908),RL 在检测器等级误差下仍能正确干预,说明学到了多信号综合判断
|
||||
- **Limitation 1**:action_accuracy=0.575;L1 层级误触发(22.9% REWRITE),轻度风险处理过激
|
||||
- **Limitation 2**:crisis_precision=0.421;L4 CRISIS 触发率仅 36.7%(Threshold 64.3%),R1 训练样本稀少(136条)+ w3=4.0 不足
|
||||
1. reward 与 a_recommend 语义冲突(矩阵式 reward 理想动作 vs 标注分布不一致)
|
||||
2. 训练 reward 用了检测器预测的 c_primary(应用 GT c_primary)
|
||||
3. REJECT 动作完全坍缩为 0%,CRISIS 泛化滥用
|
||||
|
||||
---
|
||||
|
||||
## 八、论文写作进度(2026-05-15 启动)
|
||||
## Module C — v5 结果(eval_intervention_v5.json,2026-05-19)
|
||||
|
||||
### 论文定位
|
||||
- **框架名**:CompanionGuard-RL
|
||||
- **核心主线**:Pipeline 为核心,Taxonomy 作前提条件(非并列双核)
|
||||
- **目标期刊**:SCI Q1/Q2,Information Processing & Management / Expert Systems with Applications
|
||||
- **语言**:中文草稿先行(ctexart),确定期刊后套 elsarticle 模板
|
||||
|
||||
### LaTeX 文件结构
|
||||
```
|
||||
paper/
|
||||
├── main.tex ← 主控文件(ctexart,xelatex 编译,22页)
|
||||
├── refs.bib ← 参考文献(15条)
|
||||
└── sections/
|
||||
├── 00_abstract.tex ✅ 完整
|
||||
├── 01_intro.tex ✅ 完整(动机 + 三贡献 + 结构)
|
||||
├── 02_related.tex ✅ 完整(5方向 + 对比定位表)
|
||||
├── 03_taxonomy.tex ✅ 完整(R1-R10 + 14标签,两张表)
|
||||
├── 04_dataset.tex ✅ 完整(来源 + 标注 + 统计)
|
||||
├── 05_moduleB.tex ✅ 方法完整;结果表 SOTA 列留 \todo{}
|
||||
├── 06_moduleC.tex ✅ 方法完整;v3 数字已填,v5 列留 \todo{}
|
||||
├── 07_experiments.tex 🔄 骨架(消融表留 \todo{})
|
||||
├── 08_discussion.tex ✅ 三条局限分析完整
|
||||
└── 09_conclusion.tex ✅ 框架完整
|
||||
```
|
||||
| 方法 | safety_recall | over_refusal | action_accuracy | crisis_precision |
|
||||
| ---------- | ------------- | ------------ | --------------- | ---------------- |
|
||||
| Rule-based | 0.908 | 0.000 | — | — |
|
||||
| Threshold | 0.908 | 0.000 | — | 0.624 |
|
||||
| BC-only v5 | 0.914 | 0.000 | 0.695 | 0.509 |
|
||||
| **RL v5** | **0.833 ❌** | **0.000 ✅** | **0.717 ✅** | 0.531 |
|
||||
|
||||
|
||||
**异常**:safety_recall 从 v3 的 1.000 回退至 0.833(低于 rule baseline)。根因:reward 未惩罚 L3/L4 的 WARN,标注噪声被 PPO 放大。详见 record.md。
|
||||
|
||||
---
|
||||
|
||||
## Module C v6 — 最终结果(✅ 已完成)
|
||||
|
||||
**关键改动**:`code/src/rl/reward.py` 新增 `WARN_HIGH_PENALTY = -3.0`(L3/L4 选 WARN 惩罚)+ `evaluate.py` 推理时 safety floor(L3/L4 的 WARN → REWRITE)。结果文件:`experiments/eval_intervention_v6.json`。
|
||||
|
||||
|
||||
| 指标 | 最低可接受 | v6 实际 | 状态 |
|
||||
| ---------------- | ------ | --------- | ---------------------- |
|
||||
| safety_recall | ≥ 0.95 | **0.953** | ✅ |
|
||||
| over_refusal | ≤ 0.02 | **0.000** | ✅ |
|
||||
| action_accuracy | ≥ 0.68 | **0.706** | ✅ |
|
||||
| crisis_precision | ≥ 0.50 | **0.571** | ✅ |
|
||||
| L3 WARN rate | ≤ 0.05 | **0.059** | ⚠️ 微超(在 discussion 说明) |
|
||||
| L4 WARN rate | ≤ 0.02 | **0.005** | ✅ |
|
||||
| safety_ux_fscore | — | **0.976** | — |
|
||||
|
||||
|
||||
**BC-only(消融基准)**:safety_recall=0.940,action_accuracy=0.696,crisis_precision=0.509,ux_fscore=0.969。
|
||||
|
||||
**论文使用此结果。** safety floor 属于 constrained intervention policy,论文 discussion 节如实说明。
|
||||
|
||||
---
|
||||
|
||||
## 论文写作状态
|
||||
|
||||
**目标期刊:** SCI Q2/Q3,IP&M / ESWA
|
||||
**当前进度:** 全章节完整,无 `\todo{}` 剩余(2026-05-20)
|
||||
|
||||
|
||||
| 章节 | 文件 | 状态 |
|
||||
| ------------ | ----------------------------- | ------------------------------------- |
|
||||
| Abstract | `sections/00_abstract.tex` | ✅ 完整 |
|
||||
| Introduction | `sections/01_intro.tex` | ✅ 完整 |
|
||||
| Related Work | `sections/02_related.tex` | ✅ 完整 |
|
||||
| Taxonomy | `sections/03_taxonomy.tex` | ✅ 完整 |
|
||||
| Dataset | `sections/04_dataset.tex` | ✅ 完整 |
|
||||
| Module B | `sections/05_moduleB.tex` | ✅ 消融表已填(Response-only/History+R/Full) |
|
||||
| Module C | `sections/06_moduleC.tex` | ✅ 消融表已填(BC-only/w/o Category/Full RL) |
|
||||
| Experiments | `sections/07_experiments.tex` | ✅ RQ1/RQ2 + LLM-as-judge 分析全部完成 |
|
||||
| Discussion | `sections/08_discussion.tex` | ✅ v6 数字已更新;IRB 声明视投稿期刊要求单独处理 |
|
||||
| Conclusion | `sections/09_conclusion.tex` | ✅ 完整 |
|
||||
|
||||
|
||||
**唯一待处理项:** `08_discussion.tex` IRB/伦理声明段落(占位符),确认目标期刊后补写。
|
||||
|
||||
---
|
||||
|
||||
## 消融实验结果(2026-05-20,全部完成)
|
||||
|
||||
### Module B 输入信号消融
|
||||
|
||||
|
||||
| 变体 | Binary F1 | FNR | Level-W F1 | Fine-Macro F1 | 结果文件 |
|
||||
| --------------------- | ---------- | --------- | ---------- | ------------- | ------------------------------- |
|
||||
| Response-only | 0.9990 | 0.000 | 0.5828 | 0.5025 | `eval_abl_b_response_only.json` |
|
||||
| History+Response | 0.9995 | 0.000 | 0.5837 | 0.4667 | `eval_abl_b_history_r.json` |
|
||||
| **Full P+H+R (Ours)** | **0.9995** | **0.000** | 0.5585 | 0.4633 | `eval_abl_b_full.json` |
|
||||
|
||||
|
||||
关键发现:FNR=0 对所有变体成立;context 对 binary_f1 边际贡献 +0.0005;level/fine 差异 ≤ 0.025,在训练方差范围内。
|
||||
|
||||
### Module C 奖励函数消融
|
||||
|
||||
|
||||
| 变体 | SafetyRecall | OverRefusal | ActionAcc | CrisisPrec | UX F-score | 结果文件 |
|
||||
| ------------------- | ------------ | ----------- | --------- | ---------- | ---------- | ------------------------------------ |
|
||||
| BC-only | 0.940 | 0.000 | 0.697 | 0.509 | 0.969 | `eval_intervention_v6.json` |
|
||||
| w/o Category Reward | 0.951 | 0.000 | 0.712 | 0.486 | 0.975 | `eval_abl_c_wo_category_reward.json` |
|
||||
| **Full RL (Ours)** | **0.953** | **0.000** | 0.706 | **0.571** | **0.976** | `eval_intervention_v6.json` |
|
||||
|
||||
|
||||
关键发现:PPO 提升 safety_recall +1.3pp;类别奖励提升 CrisisPrecision +8.5pp,代价是 ActionAcc -0.6pp(安全优先取舍)。
|
||||
|
||||
**本地编译:**
|
||||
|
||||
### 编译命令(本地)
|
||||
```powershell
|
||||
cd D:\Myresearch\CompanionGuard-RL\paper
|
||||
$bin = "$env:LOCALAPPDATA\Programs\MiKTeX\miktex\bin\x64"
|
||||
@@ -526,21 +167,70 @@ $bin = "$env:LOCALAPPDATA\Programs\MiKTeX\miktex\bin\x64"
|
||||
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
|
||||
& "$bin\xelatex.exe" -interaction=nonstopmode main.tex
|
||||
```
|
||||
> 注:MiKTeX 25.12 每次编译会输出 "major issue: So far, you have not checked for MiKTeX updates.",这是更新提示,**不影响 PDF 生成**,忽略即可。
|
||||
|
||||
### \todo{} 占位符说明
|
||||
所有待填内容用红色 `\todo{}` 标注,主要分三类:
|
||||
---
|
||||
|
||||
| 类型 | 位置 | 解锁条件 |
|
||||
|------|------|---------|
|
||||
| Module B SOTA baseline | §5 主结果表 | 运行 Llama Guard v2 / WildGuard 评估(无需训练 GPU,推理即可) |
|
||||
| Module C LLM-as-judge | §6 主结果表 | 调用 Qwen2.5-72B API 评估(无需 GPU) |
|
||||
| Module C v5 结果 | §6 结果 + §7 消融 | 等 GPU 跑 Module C v5 |
|
||||
| 消融实验 | §7 | 等 GPU(Module B 上下文消融需重训) |
|
||||
## 投稿前实验补强计划(2026-05-20 评估)(详细文件在C:\Users\张思远\.claude\plans\[sci2-3-precious-snail.md](http://sci2-3-precious-snail.md))
|
||||
|
||||
### 投稿前必须补充的实验(按优先级)
|
||||
1. **P0(致命)**:Llama Guard v2 / WildGuard 在 test set 的 binary_f1 等指标
|
||||
2. **P0(致命)**:Module C v5(action_accuracy ≥ 0.70,crisis_precision ≥ 0.65)
|
||||
3. **P1(严重)**:LLM-as-judge baseline for Module C
|
||||
4. **P1(严重)**:Module C 消融(BC-only vs BC+PPO)
|
||||
5. **P2(建议)**:Module B 消融(Response-only / Full 上下文)
|
||||
**真实定位**:borderline ESWA / 难 IP&M。现状直投 ESWA 接受率 ~55%,IP&M ~25%。
|
||||
|
||||
**实验层面三大短板**(按严重度):
|
||||
|
||||
1. **SOTA 基线公平性**:WildGuard / ShieldGemma 是英文模型评中文测试集,FNR=0.98 无法区分"本体差异"与"语言不匹配"——审稿首要攻击面
|
||||
2. **消融自打脸**:CrossAttn 三流融合 +0.0005 binary F1;PPO 比 BC 仅 +1.3pp safety_recall;类别奖励 +0.2pp——架构/算法卖点缺乏消融支撑
|
||||
3. **缺统计严谨性**:单 seed、无方差、无显著性检验
|
||||
|
||||
### 优先级路线(中等投入边界,~2-3 周)
|
||||
|
||||
**Tier 1(必做,credibility)**
|
||||
|
||||
|
||||
| ID | 任务 | 产出文件 | 复用 |
|
||||
| ---- | ----------------------------------------------------------------------------------- | ---------------------------------------- | ------------------------------- |
|
||||
| T1-A | 同语言强 SOTA 基线(GPT-4o-mini 或 Qwen2.5-72B as guard,带 companion 风险体系 prompt + few-shot) | `experiments/eval_sota_llmguard.json` | `eval_llm_judge_baseline.py` 骨架 |
|
||||
| T1-B | 英文翻译子集(每类 30-50 条共 ~300-500),让 WildGuard/ShieldGemma 重评,拆"语言伪影"vs"本体差异" | `experiments/eval_sota_*_en_subset.json` | `eval_sota_baselines.py` |
|
||||
| T1-C | strong LLM-as-judge:few-shot + 注入 det_l_risk + 动作语义清单 | 改造 `eval_llm_judge_baseline.py` | `llmjudge_cache.jsonl` |
|
||||
|
||||
|
||||
**Tier 2(推荐做,rigor)**
|
||||
|
||||
|
||||
| ID | 任务 | 产出文件 | 备注 |
|
||||
| ---- | ------------------------------------------------------- | ------------------------------------------- | ----------------- |
|
||||
| T2-D | Module C 多 seed(42/1234/5678)+ mean±std + paired t-test | `eval_intervention_v6_seed{1234,5678}.json` | 服务器单 GPU × 3 串行 |
|
||||
| T2-E | Per-category 行为分析:BC 已会的类 vs PPO 新学的类 | `experiments/policy_behavior_analysis.json` | 无需重训,仅后处理 |
|
||||
| T2-G | `evaluate.py` 加 `--no-safety-floor`,重跑 v6 验证策略本身质量 | `eval_intervention_v6_nofloor.json` | 改 1 处 evaluate.py |
|
||||
|
||||
|
||||
**Tier 3(暂不做)**:真实数据扩展、DPO/IQL 对照、完整跨语言泛化——超出"中等投入"边界,仅在冲 IP&M 时启用。
|
||||
|
||||
### 预期效果
|
||||
|
||||
- 完成 Tier 1+2 后 ESWA 接受率预估 **55% → 75%**
|
||||
- IP&M 即使大力补强也只到 **40-50%**,不在本轮目标内
|
||||
|
||||
### 诚实风险
|
||||
|
||||
- T2-D 可能反向打脸:v6 若是 lucky run,三 seed 平均回到 0.93 区间 → 主指标退步
|
||||
- T1-B 可能反向打脸:英文版 SOTA 召回若显著上升 → "本体差异"论点弱
|
||||
- 这些是诚实实验的必然代价,论文可信度 > 一次性接受
|
||||
|
||||
---
|
||||
|
||||
## 服务器速查
|
||||
|
||||
|
||||
| 项目 | 值 |
|
||||
| ------------- | ------------------------------------------------------------------------------ |
|
||||
| SSH | `ssh server5090`(别名)或 `ssh -p 20083 -i ~/.ssh/ai_tunnel_key root@10.82.3.180` |
|
||||
| 认证方式 | ED25519 公钥,本地密钥 `C:\Users\张思远\.ssh\ai_tunnel_key` |
|
||||
| SSH config 别名 | `~/.ssh/config` → `Host server5090`,IdentityFile 已指向 ai_tunnel_key |
|
||||
| 代理隧道 | 服务器 `127.0.0.1:7890`(HTTP proxy),pip/curl 需 `http_proxy=http://127.0.0.1:7890` |
|
||||
| 存储 UUID(当前) | `siton-data-2849d4ce327c4ccfb233ce33868fe7fe`(2026-05-19 服务器修复后) |
|
||||
| $PROJ | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL` |
|
||||
| MacBERT | `$PROJ/../macbert-large` |
|
||||
| Python 环境 | `/opt/conda/envs/dlapo-py310-cu128/bin` |
|
||||
| GPU | 4 × RTX 5090 32GB |
|
||||
|
||||
|
||||
**注意**:服务器修复/重置后存储 UUID 可能变更,届时需同步更新 `configs/intervention_config.yaml` 和 `configs/detector_config_server.yaml` 中的绝对路径。
|
||||
Reference in New Issue
Block a user