feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates
- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
86
exp.md
86
exp.md
@@ -18,6 +18,7 @@
|
||||
10. [Shell 脚本跨平台问题(CRLF)](#10-shell-脚本跨平台问题crlf)
|
||||
11. [Python 模块路径(PYTHONPATH)](#11-python-模块路径pythonpath)
|
||||
12. [可选依赖的优雅处理(wandb 等)](#12-可选依赖的优雅处理wandb-等)
|
||||
13. [HuggingFace `hf download` 大文件卡死与 curl 替代方案](#13-huggingface-hf-download-大文件卡死与-curl-替代方案)
|
||||
|
||||
---
|
||||
|
||||
@@ -422,6 +423,32 @@ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
**方案三:项目根目录加 `__init__.py`(不推荐,污染命名空间)**
|
||||
|
||||
### nohup + 非交互 SSH 陷阱(2026-05-20 实测)
|
||||
|
||||
**症状:**
|
||||
```
|
||||
ModuleNotFoundError: No module named 'src'
|
||||
```
|
||||
在本地通过脚本/Claude Code 发送 `ssh server5090 'nohup accelerate launch ...'` 时复现;直接 `ssh + attach` 进交互 shell 后手动运行不报错。
|
||||
|
||||
**根因:**
|
||||
`accelerate launch` 通过 `torch.distributed.run` 启动多个 worker 子进程。这些子进程以 `fork`/`spawn` 方式创建,**不继承父进程的 `sys.path`,也不读取父 shell 的 `export`**。非交互 SSH 执行时父进程的 `~/.bashrc` 不一定被 source,`PYTHONPATH` 从未设置。
|
||||
|
||||
**修复:在同一命令中同时 `cd` 和设 `PYTHONPATH`**
|
||||
```bash
|
||||
# ✅ 正确:cd 和 PYTHONPATH 同一行,accelerate 子进程能继承
|
||||
ssh server5090 "cd $PROJ && PYTHONPATH=$PROJ NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=1 \
|
||||
nohup /opt/conda/envs/dlapo-py310-cu128/bin/accelerate launch \
|
||||
--num_processes=4 --mixed_precision=bf16 \
|
||||
scripts/train_detector.py --config configs/xxx.yaml \
|
||||
> experiments/train.log 2>&1 &"
|
||||
|
||||
# ❌ 错误:export 在非交互 session 中不传给 accelerate 子进程
|
||||
ssh server5090 "export PYTHONPATH=$PROJ; nohup accelerate launch ..."
|
||||
```
|
||||
|
||||
**注意**:`NCCL_IB_DISABLE=1` 也需要加上(RTX 5090 InfiniBand 兼容性问题),不只是 `NCCL_P2P_DISABLE` 和 `NCCL_SHM_DISABLE`。
|
||||
|
||||
---
|
||||
|
||||
## 12. 可选依赖的优雅处理(wandb 等)
|
||||
@@ -463,6 +490,65 @@ use_wandb: true
|
||||
|
||||
---
|
||||
|
||||
## 13. HuggingFace `hf download` 大文件卡死与 curl 替代方案
|
||||
|
||||
### 症状 A:stale lock 导致新进程永久等待
|
||||
```
|
||||
Still waiting to acquire lock on .../wildguard/.cache/huggingface/download/model-00001-of-00002.safetensors.lock (elapsed: 600.0 seconds)
|
||||
```
|
||||
`hf download` 把每个文件的下载锁写到 `<local-dir>/.cache/huggingface/download/*.lock`。若前一个下载进程崩溃/被杀,锁文件残留,新进程一直等待却不下载任何数据。
|
||||
|
||||
**修复:**
|
||||
```bash
|
||||
# 1. 找并杀掉所有残留 hf 进程(该服务器无 pgrep,用 ps aux)
|
||||
ps aux | grep 'hf download' | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
|
||||
|
||||
# 2. 删除 stale 锁(只删 .cache,不影响已下载的正式文件)
|
||||
rm -rf <local-dir>/.cache
|
||||
```
|
||||
|
||||
⚠️ **注意**:`hf download` 先把文件下到 `.cache/.../download/*.incomplete`,完成后才移到最终路径。删 `.cache` 前先 `ls -lh .cache/huggingface/download/*.incomplete`,确认有没有接近完成的大文件——有的话会丢失已下载进度。
|
||||
|
||||
### 症状 B:大文件连接卡死(wget / hf download)
|
||||
```
|
||||
Connecting to huggingface.co (huggingface.co)|173.252.108.3|:443...
|
||||
```
|
||||
文件停在连接中,或 `.incomplete` 长时间停在 8KB 不增长。
|
||||
|
||||
**根因**:`wget -e use_proxy=yes` 和 `hf download` 底层对走 HTTPS 代理的大文件连接不稳定——小文件(<1MB)能通,大 shard(>1GB)会挂。
|
||||
|
||||
### 解决方案:改用 `curl --proxy` 直接下载大文件
|
||||
|
||||
```bash
|
||||
# 支持断点续传(-C -)、跟随 CDN 重定向(-L)
|
||||
curl -L \
|
||||
--proxy http://127.0.0.1:7890 \
|
||||
-H "Authorization: Bearer <HF_TOKEN>" \
|
||||
-C - \
|
||||
"https://huggingface.co/<org>/<repo>/resolve/main/<filename>" \
|
||||
-o /path/to/<filename>
|
||||
|
||||
# 后台运行
|
||||
nohup curl -L --proxy http://127.0.0.1:7890 \
|
||||
-H "Authorization: Bearer <HF_TOKEN>" -C - \
|
||||
"https://huggingface.co/allenai/wildguard/resolve/main/model-00001-of-00002.safetensors" \
|
||||
-o /path/to/wildguard/model-00001-of-00002.safetensors \
|
||||
> /tmp/curl_dl.log 2>&1 &
|
||||
```
|
||||
|
||||
### 小文件用 `hf download`,大文件用 `curl`
|
||||
| 文件类型 | 推荐方式 |
|
||||
|---------|---------|
|
||||
| tokenizer/config 等(<10MB) | `hf download` 批量下载 |
|
||||
| model shard(>1GB) | `curl -L --proxy ... -C -` 逐个下载 |
|
||||
|
||||
### HF 文件 URL 规律
|
||||
```
|
||||
https://huggingface.co/<org>/<repo>/resolve/main/<filename>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 附:本项目服务器快速参考
|
||||
|
||||
| 项目 | 值 |
|
||||
|
||||
Reference in New Issue
Block a user