448 lines
17 KiB
Markdown
448 lines
17 KiB
Markdown
|
|
# CompanionGuard-RL Change Log and Next-Stage Plan
|
|||
|
|
|
|||
|
|
**更新时间:2026-05-12**
|
|||
|
|
|
|||
|
|
## 本次研究判断
|
|||
|
|
|
|||
|
|
Module C 仍然是本课题的核心创新点,不能降级成附属实验。若目标是 SCI Q2/Q3,论文需要从“检测高风险回复”推进到“根据风险语义选择合适干预动作”,即从 safety detection 走向 adaptive intervention decision。
|
|||
|
|
|
|||
|
|
当前结果不是方向失败,而是 Module C 的动作策略还没有校准好。Module B 已经能支撑上游检测,下一阶段应集中把 Module C 做成可发表的决策模块。
|
|||
|
|
|
|||
|
|
## 最新结果位置
|
|||
|
|
|
|||
|
|
最新测试结果:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
code/CompanionGuard-RL/experiments/eval_intervention_v4.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
重要确认:
|
|||
|
|
|
|||
|
|
- `eval_intervention_v4.json` 与 `eval_intervention_v3.json` 内容一致。
|
|||
|
|
- v4 不是本地最新版 `src/rl/reward.py` reward-matrix 改动后的重训结果。
|
|||
|
|
- 本地 `src/rl/reward.py` 已在 2026-05-12 21:30 后改为矩阵式 reward,用于解决 REJECT collapse、CRISIS precision 低、L4 undertriage,但尚未重新训练并生成新的评估结果。
|
|||
|
|
|
|||
|
|
## 当前结果摘要
|
|||
|
|
|
|||
|
|
### Module B 检测器
|
|||
|
|
|
|||
|
|
Module B 已达到当前论文阶段可用水平:
|
|||
|
|
|
|||
|
|
| 指标 | 当前结果 |
|
|||
|
|
|------|----------|
|
|||
|
|
| binary_f1 | 0.9995 |
|
|||
|
|
| high_risk_recall | 1.0000 |
|
|||
|
|
| false_negative_rate | 0.0000 |
|
|||
|
|
| level_macro_f1 | 0.5496 |
|
|||
|
|
| level_weighted_f1 | 0.5585 |
|
|||
|
|
| fine_macro_f1 | 0.4633 |
|
|||
|
|
|
|||
|
|
结论:检测器可以作为 frozen upstream detector 进入 Module C,不建议继续把主要时间投入 Module B 微调。
|
|||
|
|
|
|||
|
|
### Module C 干预策略
|
|||
|
|
|
|||
|
|
当前 v4 结果:
|
|||
|
|
|
|||
|
|
| 指标 | 当前结果 | 判断 |
|
|||
|
|
|------|----------|------|
|
|||
|
|
| safety_recall(L3/L4) | 1.0000 | 安全覆盖很好 |
|
|||
|
|
| over_refusal_rate(L0) | 0.0042 | 安全样本误强干预很低 |
|
|||
|
|
| action_accuracy | 0.5754 | 不够,低于 0.70 目标 |
|
|||
|
|
| crisis_precision | 0.4211 | 不够,CRISIS 触发不够精准 |
|
|||
|
|
| safety_ux_fscore | 0.9979 | 指标过粗,区分力不足 |
|
|||
|
|
|
|||
|
|
Per-level action distribution 暴露的问题:
|
|||
|
|
|
|||
|
|
| Level | 当前 RL 行为 | 问题 |
|
|||
|
|
|-------|--------------|------|
|
|||
|
|
| L0 Safe | 98.7% PASS,0.4% REWRITE | 基本可接受 |
|
|||
|
|
| L1 Mild | 72.9% PASS,22.9% REWRITE,3.2% CRISIS | 轻微风险处理偏激进 |
|
|||
|
|
| L2 Moderate | 90.2% REWRITE,9.8% CRISIS | 对中风险偏重 |
|
|||
|
|
| L3 High | 87.1% REWRITE,12.9% CRISIS | 完全没有 REJECT |
|
|||
|
|
| L4 Critical | 63.3% REWRITE,36.7% CRISIS | CRISIS 不足,严重风险仍大量只改写 |
|
|||
|
|
|
|||
|
|
关键问题:
|
|||
|
|
|
|||
|
|
- RL 学到了“不要漏掉高风险”,但没有学好“动作类型要合适”。
|
|||
|
|
- `REJECT` 动作完全坍缩为 0%,动作空间没有被充分利用。
|
|||
|
|
- `CRISIS` 被用于部分非 L4 样本,导致 precision 低。
|
|||
|
|
- `intervention_recall_high` 和 `safety_ux_fscore` 太宽松,掩盖了动作校准问题。
|
|||
|
|
|
|||
|
|
## 根因诊断
|
|||
|
|
|
|||
|
|
### 1. 当前 reward 与标注动作语义存在冲突
|
|||
|
|
|
|||
|
|
测试集中 `a_recommend` 分布如下:
|
|||
|
|
|
|||
|
|
| Level | 主要标注动作 |
|
|||
|
|
|-------|--------------|
|
|||
|
|
| L0 | 100% PASS |
|
|||
|
|
| L1 | 99.3% PASS |
|
|||
|
|
| L2 | 93.4% WARN |
|
|||
|
|
| L3 | 74.3% REWRITE,17.5% REJECT,8.1% CRISIS |
|
|||
|
|
| L4 | 55.6% REJECT,44.4% CRISIS |
|
|||
|
|
|
|||
|
|
但最新版 reward matrix 的理想动作更接近:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
L0 -> PASS
|
|||
|
|
L1 -> WARN
|
|||
|
|
L2 -> REWRITE
|
|||
|
|
L3 -> REJECT
|
|||
|
|
L4 -> CRISIS
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
这个设计能修复 REJECT/CRISIS 不足,但会显著降低 `action_accuracy`,因为它和数据集现有 `a_recommend` 定义不一致。
|
|||
|
|
|
|||
|
|
下一阶段不能简单“加大 CRISIS 奖励”,必须先统一动作本体:哪些场景应该 WARN、REWRITE、REJECT、CRISIS。
|
|||
|
|
|
|||
|
|
### 2. 训练 reward 里类别信号应使用 ground truth
|
|||
|
|
|
|||
|
|
`CompanionEnv.step()` 当前使用 `sample.get("c_primary_idx", 0)` 传入 reward。该字段来自检测器预测,不是 ground-truth `c_primary`。训练 reward 应该使用 ground-truth category,状态输入仍然使用 detector prediction,这样才符合 offline RL 的训练设定:
|
|||
|
|
|
|||
|
|
- observation:部署时可见的 detector outputs
|
|||
|
|
- reward:训练时可用的标注真值
|
|||
|
|
|
|||
|
|
否则 R1/CRISIS、R6/R7/REJECT 等类别特异奖励会被 detector category error 稀释。
|
|||
|
|
|
|||
|
|
### 3. 现有评估指标不足以证明 adaptive intervention
|
|||
|
|
|
|||
|
|
当前主指标 `safety_recall(L3/L4)` 只要求 action >= REWRITE,因此 REWRITE、REJECT、CRISIS 都算正确。这对安全覆盖有意义,但不能证明策略具有动作选择能力。
|
|||
|
|
|
|||
|
|
下一阶段必须补充:
|
|||
|
|
|
|||
|
|
- `action_accuracy` vs `a_recommend`
|
|||
|
|
- `exact_action_accuracy_by_level`
|
|||
|
|
- `R1_high_critical_crisis_recall`
|
|||
|
|
- `crisis_precision_l4_or_r1`
|
|||
|
|
- `reject_rate_for_R6_R7_high`
|
|||
|
|
- `strong_intervention_rate_on_L1`
|
|||
|
|
- `per_category_action_distribution`
|
|||
|
|
- `BC-only vs PPO` ablation
|
|||
|
|
|
|||
|
|
## 下一阶段目标:Module C v5
|
|||
|
|
|
|||
|
|
### 总目标
|
|||
|
|
|
|||
|
|
把 Module C 从“高风险都能干预”推进到“能根据风险等级和风险类别选择差异化动作”的 adaptive intervention policy。
|
|||
|
|
|
|||
|
|
论文中 Module C 应作为第二个主要贡献:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
Contribution 1: CompanionRisk-Bench and taxonomy for AI companion risks.
|
|||
|
|
Contribution 2: Context-aware detector for companion-specific risk recognition.
|
|||
|
|
Contribution 3: Adaptive intervention policy that maps detected risk states to graded safety actions.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## v5 技术路线
|
|||
|
|
|
|||
|
|
### Step 1:冻结 Module B,明确 Module C 的动作语义
|
|||
|
|
|
|||
|
|
建议采用以下动作本体:
|
|||
|
|
|
|||
|
|
| 条件 | 目标动作 |
|
|||
|
|
|------|----------|
|
|||
|
|
| L0 Safe | PASS |
|
|||
|
|
| L1 Mild | PASS,少量 WARN |
|
|||
|
|
| L2 Moderate | WARN 为主,少量 REWRITE |
|
|||
|
|
| L3 High | REWRITE 为主,类别严重时 REJECT/CRISIS |
|
|||
|
|
| L4 Critical + R1 | CRISIS |
|
|||
|
|
| L4 Critical + R6/R7 | REJECT |
|
|||
|
|
| L4 Critical + R5/R8/R9 | REJECT 或 CRISIS,按风险内容决定 |
|
|||
|
|
| L4 Critical + R2/R3/R4/R10 | REWRITE 或 REJECT,避免一律 CRISIS |
|
|||
|
|
|
|||
|
|
核心原则:
|
|||
|
|
|
|||
|
|
- CRISIS 不应等价于“所有 L4”,它应主要用于 R1 自伤/自杀危机和明确紧急求助场景。
|
|||
|
|
- REJECT 不应消失,它适合不可安全改写的暴力、未成年人亲密、隐私诱导、危险操作等场景。
|
|||
|
|
- REWRITE 是 companion 场景最有价值的动作,适合依赖强化、隔离强化、错误安慰、伪治疗等可修复回复。
|
|||
|
|
|
|||
|
|
### Step 2:重写 reward 为 label-aligned constrained reward
|
|||
|
|
|
|||
|
|
新 reward 不应把 level-to-action 映射写死成 `L1->WARN, L2->REWRITE, L3->REJECT, L4->CRISIS`。应以 `a_recommend` 为主目标,再加入安全约束和类别约束。
|
|||
|
|
|
|||
|
|
建议 reward 结构:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
reward =
|
|||
|
|
exact_action_alignment
|
|||
|
|
+ safety_constraint
|
|||
|
|
+ category_specific_bonus
|
|||
|
|
- false_negative_penalty
|
|||
|
|
- over_intervention_penalty
|
|||
|
|
- crisis_misuse_penalty
|
|||
|
|
- reject_misuse_penalty
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
建议权重:
|
|||
|
|
|
|||
|
|
| 项 | 建议值 | 说明 |
|
|||
|
|
|----|--------|------|
|
|||
|
|
| exact_action_alignment | +3.0 | 与 `a_recommend` 一致应是主奖励 |
|
|||
|
|
| adjacent_action_partial | +1.0 | 相邻合理动作给小奖励,例如 L3 REWRITE/REJECT |
|
|||
|
|
| PASS on L3/L4 | -5.0 | 高危漏干预必须重罚 |
|
|||
|
|
| PASS on L2 | -2.0 | 中风险漏处理轻罚 |
|
|||
|
|
| strong intervention on L0 | -5.0 | 安全内容被 REWRITE/REJECT/CRISIS 重罚 |
|
|||
|
|
| CRISIS on non-R1 and non-L4 | -3.0 | 降低 CRISIS 泛化滥用 |
|
|||
|
|
| CRISIS on R1 L3/L4 | +3.0 | 提高危机干预召回 |
|
|||
|
|
| REJECT on R6/R7 L3/L4 | +2.0 | 提升不可安全回复场景的拒绝能力 |
|
|||
|
|
| REWRITE on R3/R4/R10 L2/L3 | +1.5 | 突出 companion-specific adaptive rewriting |
|
|||
|
|
|
|||
|
|
### Step 3:修复训练环境类别信号
|
|||
|
|
|
|||
|
|
修改:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
src/rl/companion_env.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
训练 reward 中增加 ground-truth `c_primary` 到 index 的转换:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from src.utils.taxonomy import category_to_index
|
|||
|
|
|
|||
|
|
gt_category = sample.get("c_primary", "None")
|
|||
|
|
if gt_category in PRIMARY_CATEGORY_LIST:
|
|||
|
|
reward_category_idx = category_to_index(gt_category)
|
|||
|
|
else:
|
|||
|
|
reward_category_idx = int(sample.get("c_primary_idx", 0))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
然后把 `reward_category_idx` 传给 `compute_reward()`。
|
|||
|
|
|
|||
|
|
### Step 4:加入 BC-only 和 PPO v5 对照
|
|||
|
|
|
|||
|
|
需要新增或保留三类策略:
|
|||
|
|
|
|||
|
|
| 策略 | 作用 |
|
|||
|
|
|------|------|
|
|||
|
|
| Rule/Threshold | 规则基线 |
|
|||
|
|
| BC-only | 证明监督动作学习能达到的上限或稳定性 |
|
|||
|
|
| BC + PPO v5 | 证明 reward 优化带来的安全和类别动作收益 |
|
|||
|
|
|
|||
|
|
BC-only 很重要。如果 PPO v5 未明显超过 BC-only,也可以把论文叙事调整为“supervised warm-up with constrained RL fine-tuning”,而不是硬说 PPO 是唯一贡献。
|
|||
|
|
|
|||
|
|
### Step 5:扩展评估指标
|
|||
|
|
|
|||
|
|
修改:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
src/utils/metrics.py
|
|||
|
|
scripts/evaluate.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
新增指标:
|
|||
|
|
|
|||
|
|
| 指标 | 目标 |
|
|||
|
|
|------|------|
|
|||
|
|
| action_accuracy | >= 0.70 |
|
|||
|
|
| exact_action_accuracy_L4 | >= 0.65 |
|
|||
|
|
| R1_high_critical_crisis_recall | >= 0.80 |
|
|||
|
|
| crisis_precision | >= 0.65,理想 >= 0.80 |
|
|||
|
|
| reject_rate_R6_R7_high | >= 0.60 |
|
|||
|
|
| strong_intervention_rate_L1 | <= 0.05 |
|
|||
|
|
| safety_recall_L3_L4 | >= 0.95 |
|
|||
|
|
| over_refusal_L0 | <= 0.02 |
|
|||
|
|
|
|||
|
|
这些指标比单独 `safety_ux_fscore` 更能支撑“adaptive”。
|
|||
|
|
|
|||
|
|
### Step 6:重训并产出 v5
|
|||
|
|
|
|||
|
|
建议输出文件:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
checkpoints/intervention/final_v5.pt
|
|||
|
|
experiments/train_intervention_v5_YYYYMMDD_HHMMSS.log
|
|||
|
|
experiments/eval_intervention_v5.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
建议训练命令:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL
|
|||
|
|
export PYTHONPATH=$PWD
|
|||
|
|
CUDA_VISIBLE_DEVICES=0 \
|
|||
|
|
/opt/conda/envs/dlapo-py310-cu128/bin/accelerate launch \
|
|||
|
|
--num_processes=1 --mixed_precision=bf16 \
|
|||
|
|
scripts/train_intervention.py \
|
|||
|
|
--config configs/intervention_config.yaml \
|
|||
|
|
--train-data data/processed/CompanionRisk-Bench/train.jsonl \
|
|||
|
|
> experiments/train_intervention_v5_$(date +%Y%m%d_%H%M%S).log 2>&1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
评估命令:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python scripts/evaluate.py \
|
|||
|
|
--detector-ckpt checkpoints/detector/best.pt \
|
|||
|
|
--agent-ckpt checkpoints/intervention/final.pt \
|
|||
|
|
--test-data data/processed/CompanionRisk-Bench/test.jsonl \
|
|||
|
|
--config configs/detector_config_server.yaml \
|
|||
|
|
--intervention-config configs/intervention_config.yaml \
|
|||
|
|
--output experiments/eval_intervention_v5.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
完成后将 `final.pt` 另存为:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cp checkpoints/intervention/final.pt checkpoints/intervention/final_v5.pt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## v5 成败判定
|
|||
|
|
|
|||
|
|
### 可作为论文主结果的标准
|
|||
|
|
|
|||
|
|
满足以下多数条件即可作为主结果:
|
|||
|
|
|
|||
|
|
| 指标 | 最低可接受 | 理想 |
|
|||
|
|
|------|------------|------|
|
|||
|
|
| safety_recall_L3_L4 | >= 0.95 | >= 0.98 |
|
|||
|
|
| over_refusal_L0 | <= 0.02 | <= 0.01 |
|
|||
|
|
| action_accuracy | >= 0.70 | >= 0.75 |
|
|||
|
|
| crisis_precision | >= 0.65 | >= 0.80 |
|
|||
|
|
| R1_high_critical_crisis_recall | >= 0.80 | >= 0.90 |
|
|||
|
|
| strong_intervention_rate_L1 | <= 0.05 | <= 0.03 |
|
|||
|
|
| REJECT usage | 非 0,且集中在 R6/R7/L4 | 类别分布合理 |
|
|||
|
|
|
|||
|
|
### 如果 v5 未达标
|
|||
|
|
|
|||
|
|
不要继续盲目调 PPO。采用备选路线:
|
|||
|
|
|
|||
|
|
1. 使用 BC-only 作为主策略,PPO 作为 ablation。
|
|||
|
|
2. 引入 constrained decoding policy:模型输出动作 logits 后,用规则 mask 禁止明显不合理动作。
|
|||
|
|
3. 将 Module C 表述为 hybrid adaptive policy:learned policy + safety constraints。
|
|||
|
|
4. 把重点指标从 `crisis_precision` 转为 category-aware intervention quality。
|
|||
|
|
|
|||
|
|
## 论文写法建议
|
|||
|
|
|
|||
|
|
Module C 的论文叙事应避免只说“RL 比规则好”。更强的说法是:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
Existing safety systems usually stop at risk classification.
|
|||
|
|
CompanionGuard-RL further learns a graded intervention policy that maps contextual risk states to differentiated actions, including pass-through, warning, rewriting, rejection, and crisis escalation.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
实验表格建议:
|
|||
|
|
|
|||
|
|
1. Detection comparison: L1 rules vs Module B.
|
|||
|
|
2. Intervention summary: Rule, Threshold, BC-only, PPO v5.
|
|||
|
|
3. Per-level action distribution.
|
|||
|
|
4. Per-category action distribution for R1/R3/R4/R6/R7/R10.
|
|||
|
|
5. Ablation: without category-specific reward, without alignment reward, without PPO.
|
|||
|
|
|
|||
|
|
## 二次审查新增隐患(2026-05-12)
|
|||
|
|
|
|||
|
|
### 隐患 1:`action_accuracy` 可能变成循环论证
|
|||
|
|
|
|||
|
|
`a_recommend` 大量来自生成脚本和规则映射,不是完全独立的人类专家标注。如果 v5 reward 以 `a_recommend` 为主,最后再用 `action_accuracy` 证明策略好,审稿人可能质疑这是“训练目标和评估指标同源”。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- `action_accuracy` 可以保留,但不能作为唯一主指标。
|
|||
|
|
- 必须同时报告 safety/category 指标:R1 crisis recall、R6/R7 reject rate、L1 strong intervention rate、per-category action distribution。
|
|||
|
|
- 抽样 50-100 条 Module C 预测结果做人类复核,作为 intervention quality case audit。
|
|||
|
|
|
|||
|
|
### 隐患 2:一阶 MDP 使用 PPO 的合理性可能被质疑
|
|||
|
|
|
|||
|
|
当前 `CompanionEnv` 是 single-step MDP,每个样本一步结束。严格来说,这更像 contextual bandit / reward-regularized policy learning,而不是典型多步 RL。若论文强行强调 PPO,SCI 审稿人可能问:为什么不用 cost-sensitive classifier 或 supervised policy network?
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- 论文中避免夸大“长期序列决策”,把 Module C 表述为 reward-optimized adaptive intervention policy。
|
|||
|
|
- 实验中加入 BC-only、cost-sensitive classifier 或 rule-masked classifier 对照。
|
|||
|
|
- 如果时间允许,后续再扩展 multi-turn intervention simulation;当前 v5 先把单步策略做扎实。
|
|||
|
|
|
|||
|
|
### 隐患 3:BC-only 可能已经足够,PPO 增益不明显
|
|||
|
|
|
|||
|
|
当前计划提到 BC-only,但还没有明确保存 BC-only checkpoint。如果 PPO v5 只是把 BC 学到的动作重新扰动一遍,可能无法证明 RL 部分的必要性。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- 训练脚本应在 BC 结束后保存 `checkpoints/intervention/bc_only_v5.pt`。
|
|||
|
|
- 评估表必须包含 `BC-only` 与 `BC+PPO v5`。
|
|||
|
|
- PPO 的成功标准应是:不显著降低 `action_accuracy`,同时提升 safety/category 指标,例如 R1 crisis recall 或 R6/R7 reject rate。
|
|||
|
|
|
|||
|
|
### 隐患 4:`crisis_precision` 定义需要和动作语义统一
|
|||
|
|
|
|||
|
|
当前 `metrics.py` 中 `crisis_precision` 只把 L4 算作正确 CRISIS。如果 v5 动作语义允许 R1 L3 也触发 CRISIS,那么旧 `crisis_precision` 会把合理的 R1 L3 CRISIS 当成错误,导致指标和论文定义冲突。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- 保留旧指标并改名为 `crisis_precision_l4`。
|
|||
|
|
- 新增 `crisis_appropriateness = CRISIS on (L4 or R1 with L3/L4)`。
|
|||
|
|
- 新增 `R1_high_critical_crisis_recall`,单独证明危机响应能力。
|
|||
|
|
|
|||
|
|
### 隐患 5:训练状态使用 detector train-set 预测,可能有过拟合痕迹
|
|||
|
|
|
|||
|
|
Module C 的训练 observation 来自 frozen detector 对 train set 的预测,而 detector 本身也在 train set 上训练过。这样得到的 `det_l_risk` 和 category probs 可能比真实部署更干净,导致 Module C 训练环境偏乐观。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- 短期:在论文中明确 Module C 训练使用 frozen detector outputs,评估在 held-out test 上完成。
|
|||
|
|
- 中期:加入 detector noise augmentation,例如随机扰动 level one-hot 或 category probs,增强策略鲁棒性。
|
|||
|
|
- 最稳:用 out-of-fold detector predictions 构建 Module C 训练状态,但这需要额外重训多个 detector,当前不是优先项。
|
|||
|
|
|
|||
|
|
### 隐患 6:checkpoint 覆盖会污染结果追踪
|
|||
|
|
|
|||
|
|
当前训练脚本固定保存到 `checkpoints/intervention/final.pt`。如果直接重训 v5,旧的 v3/v4 权重可能被覆盖,后续无法复现表格。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- 训练前先复制当前权重:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cp checkpoints/intervention/final.pt checkpoints/intervention/final_v4_before_v5.pt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- BC 后保存:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
checkpoints/intervention/bc_only_v5.pt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- PPO 后保存:
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
checkpoints/intervention/final_v5.pt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 隐患 7:`wandb` 和配置可能导致训练卡住
|
|||
|
|
|
|||
|
|
当前本地 `configs/intervention_config.yaml` 中 `use_wandb: true`,且 `scripts/train_intervention.py` 存在直接 `import wandb`。服务器受限环境下容易因为 wandb 缺失、未登录或网络不可用导致训练失败或卡住。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- v5 配置固定设置 `use_wandb: false`。
|
|||
|
|
- 或在启动命令中加入:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
export WANDB_MODE=disabled
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- 最好把 `import wandb` 改为 try/except,保持离线训练可运行。
|
|||
|
|
|
|||
|
|
### 隐患 8:缺少最小单元测试,reward 改动容易反向破坏指标
|
|||
|
|
|
|||
|
|
当前项目没有 `tests/` 目录。v5 会改 reward、env、metrics,如果没有最小测试,很容易出现“训练能跑但指标含义错了”的问题。
|
|||
|
|
|
|||
|
|
应对:
|
|||
|
|
|
|||
|
|
- 新增 `tests/test_reward_v5.py`,覆盖 L0/L1/L2/L3/L4 和 R1/R6/R7 类别奖励。
|
|||
|
|
- 新增 `tests/test_intervention_metrics.py`,覆盖 crisis appropriateness、R1 recall、reject rate、strong intervention on L1。
|
|||
|
|
- 在远程训练前先本地跑通这些小测试。
|
|||
|
|
|
|||
|
|
## 立即执行清单
|
|||
|
|
|
|||
|
|
- [ ] 修改 `src/rl/reward.py` 为 label-aligned constrained reward。
|
|||
|
|
- [ ] 修改 `src/rl/companion_env.py`,reward 使用 ground-truth `c_primary`。
|
|||
|
|
- [ ] 修改 `src/utils/metrics.py`,新增 category-aware intervention metrics。
|
|||
|
|
- [ ] 修改 `scripts/evaluate.py`,输出新指标和 BC-only 对照。
|
|||
|
|
- [ ] 保存当前 v4 权重,避免 v5 覆盖旧结果。
|
|||
|
|
- [ ] 在 BC 结束时保存 `bc_only_v5.pt`。
|
|||
|
|
- [ ] 关闭或离线化 wandb。
|
|||
|
|
- [ ] 增加 reward 和 metrics 的最小单元测试。
|
|||
|
|
- [ ] 训练 Module C v5。
|
|||
|
|
- [ ] 生成 `experiments/eval_intervention_v5.json`。
|
|||
|
|
- [ ] 更新 `2026-05-12-state.md` 或新建 `2026-05-13-state.md`。
|
|||
|
|
- [ ] 根据 v5 结果决定论文主表和 limitation 写法。
|