- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
563 lines
17 KiB
Markdown
563 lines
17 KiB
Markdown
# CompanionGuard-RL — 可复用经验库
|
||
**创建时间:2026-05-12**
|
||
**来源:Module B + Module C 训练调试过程中积累的真实踩坑记录**
|
||
|
||
---
|
||
|
||
## 目录
|
||
|
||
1. [RTX 5090 / NCCL 通信问题](#1-rtx-5090--nccl-通信问题)
|
||
2. [HuggingFace Accelerate 多 GPU 分布式训练](#2-huggingface-accelerate-多-gpu-分布式训练)
|
||
3. [PyYAML 配置文件陷阱](#3-pyyaml-配置文件陷阱)
|
||
4. [服务器文件传输(无 rsync 环境)](#4-服务器文件传输无-rsync-环境)
|
||
5. [SSH 连接与持久会话管理](#5-ssh-连接与持久会话管理)
|
||
6. [Python 依赖与包缺失处理](#6-python-依赖与包缺失处理)
|
||
7. [分布式训练中的 Tensor 设备一致性](#7-分布式训练中的-tensor-设备一致性)
|
||
8. [DataLoader 与分布式训练的兼容](#8-dataloader-与分布式训练的兼容)
|
||
9. [离线服务器的模型加载](#9-离线服务器的模型加载)
|
||
10. [Shell 脚本跨平台问题(CRLF)](#10-shell-脚本跨平台问题crlf)
|
||
11. [Python 模块路径(PYTHONPATH)](#11-python-模块路径pythonpath)
|
||
12. [可选依赖的优雅处理(wandb 等)](#12-可选依赖的优雅处理wandb-等)
|
||
13. [HuggingFace `hf download` 大文件卡死与 curl 替代方案](#13-huggingface-hf-download-大文件卡死与-curl-替代方案)
|
||
|
||
---
|
||
|
||
## 1. RTX 5090 / NCCL 通信问题
|
||
|
||
### 症状
|
||
```
|
||
[rank0]: CUDA error: an illegal memory access was encountered
|
||
```
|
||
在多 GPU 训练中,某一阶段(如 BC warmup 后进入 PPO,或切换数据集后)突发崩溃,单 GPU 无此问题。
|
||
|
||
### 根因
|
||
RTX 5090 的 NVLink/P2P 拓扑与 NCCL 默认的共享内存(SHM)和 P2P 直连通信不兼容,导致跨 GPU 内存访问越界。
|
||
|
||
### 解决方案
|
||
```bash
|
||
# 同时禁用 SHM 和 P2P,强制 NCCL 走 socket 通信
|
||
export NCCL_SHM_DISABLE=1
|
||
export NCCL_P2P_DISABLE=1
|
||
```
|
||
|
||
**在 accelerate launch 前设置(推荐写法):**
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_SHM_DISABLE=1 NCCL_P2P_DISABLE=1 \
|
||
accelerate launch --num_processes=4 --mixed_precision=bf16 \
|
||
scripts/train_xxx.py ...
|
||
```
|
||
|
||
### 排查顺序
|
||
1. 先加 `NCCL_SHM_DISABLE=1` → 若仍崩溃
|
||
2. 再加 `NCCL_P2P_DISABLE=1` → 通常可解
|
||
3. 若仍有问题,尝试 `NCCL_DEBUG=INFO` 查看具体哪个集合通信操作出错
|
||
|
||
### 性能影响
|
||
禁用 P2P 后 GPU 间通信走 PCIe,带宽略降,但对 batch_size=256 量级的训练影响不超过 10%。
|
||
|
||
---
|
||
|
||
## 2. HuggingFace Accelerate 多 GPU 分布式训练
|
||
|
||
### accelerate 路径问题
|
||
服务器有多个 conda 环境时,直接敲 `accelerate` 可能用到错误环境的版本,或报 `command not found`。
|
||
|
||
**正确做法:用 conda 环境的完整路径**
|
||
```bash
|
||
# 查找正确路径
|
||
find /opt/conda/envs -name "accelerate" -type f 2>/dev/null
|
||
|
||
# 使用完整路径启动
|
||
/opt/conda/envs/dlapo-py310-cu128/bin/accelerate launch ...
|
||
```
|
||
|
||
### PYTHONPATH 设置
|
||
使用 `accelerate launch` 时,各 rank 子进程不继承当前 shell 的 `sys.path`,自定义 `src/` 包会报 `ModuleNotFoundError`。
|
||
|
||
```bash
|
||
PYTHONPATH=/path/to/project accelerate launch ...
|
||
```
|
||
|
||
### 推荐完整启动命令模板
|
||
```bash
|
||
cd /path/to/project
|
||
PYTHONPATH=$(pwd) \
|
||
CUDA_VISIBLE_DEVICES=0,1,2,3 \
|
||
NCCL_SHM_DISABLE=1 \
|
||
NCCL_P2P_DISABLE=1 \
|
||
/opt/conda/envs/<env>/bin/accelerate launch \
|
||
--num_processes=4 \
|
||
--mixed_precision=bf16 \
|
||
scripts/train_xxx.py \
|
||
--config configs/xxx.yaml \
|
||
> experiments/train_$(date +%Y%m%d_%H%M%S).log 2>&1 &
|
||
echo "PID: $! LOG: $LOG"
|
||
```
|
||
|
||
---
|
||
|
||
## 3. PyYAML 配置文件陷阱
|
||
|
||
### 症状
|
||
```
|
||
TypeError: '<=' not supported between instances of 'float' and 'str'
|
||
```
|
||
明明写的是数字,PyYAML 却解析成字符串。
|
||
|
||
### 根因
|
||
**PyYAML 6.x 将科学计数法(如 `1e-3`、`3e-4`)解析为字符串,而非浮点数。**
|
||
|
||
PyYAML 5.x 以下正常,6.x 以上需要避免。
|
||
|
||
### 解决方案
|
||
将所有科学计数法改为小数形式:
|
||
```yaml
|
||
# ❌ 会被解析为字符串
|
||
lr: 1e-3
|
||
lr: 3e-4
|
||
|
||
# ✅ 正确写法
|
||
lr: 0.001
|
||
lr: 0.0003
|
||
```
|
||
|
||
### 快速检查
|
||
```python
|
||
import yaml
|
||
cfg = yaml.safe_load(open("config.yaml"))
|
||
print(type(cfg["lr"])) # 应为 <class 'float'>,若为 <class 'str'> 则有问题
|
||
```
|
||
|
||
---
|
||
|
||
## 4. 服务器文件传输(无 rsync 环境)
|
||
|
||
### 背景
|
||
- 本地 Windows,目标 Linux GPU 服务器
|
||
- 本地 WSL 无 `rsync`,PowerShell 无原生 rsync
|
||
- 文件较多,直接 `scp -r` 速度慢且不方便增量同步
|
||
|
||
### 推荐方案:tar 打包 + scp 单文件传输
|
||
|
||
**本地打包(PowerShell):**
|
||
```powershell
|
||
# 打包项目代码(排除数据集、checkpoint、缓存)
|
||
tar -czf sync_v4.tar.gz `
|
||
-C "D:\Myresearch\CompanionGuard-RL\code\CompanionGuard-RL" `
|
||
--exclude=".git" --exclude="__pycache__" `
|
||
--exclude="checkpoints" --exclude="experiments" `
|
||
src scripts configs requirements.txt
|
||
|
||
# 使用 WSL sshpass 上传
|
||
wsl -d Ubuntu-24.04 -- sshpass -p 'PASSWORD' scp -P PORT \
|
||
/mnt/d/Myresearch/CompanionGuard-RL/sync_v4.tar.gz \
|
||
root@HOST:/remote/path/
|
||
```
|
||
|
||
**服务器解压(覆盖更新):**
|
||
```bash
|
||
cd /remote/project/dir
|
||
tar -xzf ../sync_v4.tar.gz --strip-components=0
|
||
```
|
||
|
||
### Windows 路径转 WSL 路径
|
||
```
|
||
D:\Myresearch\... → /mnt/d/Myresearch/...
|
||
```
|
||
|
||
### sshpass 在 WSL 中使用
|
||
```bash
|
||
# 安装
|
||
sudo apt-get install sshpass
|
||
|
||
# 密码直接传参(注意在脚本中要保护密码)
|
||
sshpass -p 'PASSWORD' ssh -p PORT user@host 'command'
|
||
sshpass -p 'PASSWORD' scp -P PORT local_file user@host:/remote/path/
|
||
```
|
||
|
||
---
|
||
|
||
## 5. SSH 连接与持久会话管理
|
||
|
||
### nohup vs tmux
|
||
| 方式 | 优点 | 缺点 |
|
||
|------|------|------|
|
||
| `nohup ... &` | 简单 | 非交互式 SSH 中 nohup 进程在连接断开后有时会收到 SIGHUP 而退出;无法重新 attach 查看输出 |
|
||
| `tmux` | 会话持久,可 attach/detach,输出可随时查看 | 需要服务器安装 tmux |
|
||
|
||
**推荐用 tmux:**
|
||
```bash
|
||
# 创建新会话并启动训练
|
||
tmux new-session -d -s train 'PYTHONPATH=... accelerate launch ...'
|
||
|
||
# 查看所有会话
|
||
tmux ls
|
||
|
||
# 重新连接查看输出
|
||
tmux attach -t train
|
||
|
||
# 在会话中执行命令(不 attach)
|
||
tmux send-keys -t train 'tail -f experiments/latest.log' Enter
|
||
```
|
||
|
||
### SSH 连接被拒绝但 ping 通(kex_exchange_identification)
|
||
症状:TCP 端口开放,ping 通,但 SSH 在握手前被关闭:
|
||
```
|
||
kex_exchange_identification: Connection closed by remote host
|
||
```
|
||
|
||
可能原因及处理:
|
||
1. **sshd 崩溃/重启中** → 通过网页控制台(VNC)执行 `systemctl restart sshd`
|
||
2. **MaxStartups 限制** → sshd_config 中 `MaxStartups 10:30:60` 可临时调高
|
||
3. **fail2ban 封 IP** → `fail2ban-client status sshd`,`fail2ban-client set sshd unbanip <IP>`
|
||
|
||
---
|
||
|
||
## 6. Python 依赖与包缺失处理
|
||
|
||
### 服务器无网络时安装包
|
||
|
||
**方法一:从已有 conda 环境复制**
|
||
```bash
|
||
# 查找其他环境中的包位置
|
||
find /opt/conda/envs -name "gymnasium" -type d 2>/dev/null
|
||
|
||
# 直接复制到目标环境
|
||
cp -r /opt/conda/envs/other-env/lib/python3.10/site-packages/gymnasium \
|
||
/opt/conda/envs/target-env/lib/python3.10/site-packages/
|
||
```
|
||
|
||
**方法二:本地下载 wheel,scp 传输,离线安装**
|
||
```powershell
|
||
# 本地下载(PowerShell)
|
||
pip download -d D:\wheels --platform linux_x86_64 --python-version 310 \
|
||
--only-binary=:all: gymnasium
|
||
# scp 传到服务器后:
|
||
pip install --no-index --find-links=/path/to/wheels gymnasium
|
||
```
|
||
|
||
### 检查包是否可用
|
||
```bash
|
||
python -c "import gymnasium; print(gymnasium.__version__)"
|
||
python -c "import torch; print(torch.cuda.device_count())"
|
||
```
|
||
|
||
---
|
||
|
||
## 7. 分布式训练中的 Tensor 设备一致性
|
||
|
||
### 症状
|
||
```
|
||
RuntimeError: No backend type associated with device type cpu
|
||
```
|
||
在 `torch.distributed.broadcast()` 等集合通信操作中,传入了 CPU tensor。
|
||
|
||
### 根因
|
||
**NCCL 后端只支持 CUDA tensor**,所有参与 `broadcast/all_reduce/gather` 的 tensor 必须在 GPU 上。
|
||
|
||
### 修复模式
|
||
```python
|
||
dev = accelerator.device # 当前 rank 的 CUDA device
|
||
|
||
# 广播 size
|
||
size_tensor = torch.tensor([data.shape[0]], dtype=torch.long, device=dev)
|
||
torch.distributed.broadcast(size_tensor, src=0)
|
||
n = size_tensor.item()
|
||
|
||
# 广播数据
|
||
if accelerator.is_main_process:
|
||
data = data.to(dev)
|
||
else:
|
||
data = torch.zeros(n, data_dim, device=dev) # 必须在 GPU 上
|
||
|
||
torch.distributed.broadcast(data, src=0)
|
||
# 使用后如需 CPU,再 .cpu()
|
||
```
|
||
|
||
### 关键原则
|
||
- 集合通信(broadcast/all_reduce/scatter)→ **必须 CUDA tensor**
|
||
- DataLoader 输入 → **CPU tensor**(除非 `pin_memory=False`)
|
||
- 在 GPU 计算完成后,如需放入 CPU DataLoader,显式 `.cpu()`
|
||
|
||
---
|
||
|
||
## 8. DataLoader 与分布式训练的兼容
|
||
|
||
### pin_memory 陷阱
|
||
```
|
||
RuntimeError: cannot pin torch.cuda.FloatTensor
|
||
```
|
||
`DataLoader(pin_memory=True)` 要求数据必须是 **CPU tensor**,若传入已在 GPU 上的 tensor 则报错。
|
||
|
||
**修复:构建 TensorDataset 前先移到 CPU**
|
||
```python
|
||
# ❌ 若 obs_tensor 在 GPU 上会崩溃
|
||
dataset = TensorDataset(obs_tensor, action_tensor)
|
||
loader = DataLoader(dataset, pin_memory=True)
|
||
|
||
# ✅ 先 .cpu()
|
||
dataset = TensorDataset(obs_tensor.cpu(), action_tensor.cpu())
|
||
loader = DataLoader(dataset, pin_memory=True)
|
||
```
|
||
|
||
### set_epoch 守卫
|
||
```
|
||
AttributeError: 'SequentialSampler' object has no attribute 'set_epoch'
|
||
```
|
||
`set_epoch` 只有 `DistributedSampler` 有,`SequentialSampler` 没有。
|
||
|
||
**修复:加 hasattr 守卫**
|
||
```python
|
||
# ❌ 直接调用
|
||
loader.sampler.set_epoch(epoch)
|
||
|
||
# ✅ 安全写法
|
||
if hasattr(loader.sampler, "set_epoch"):
|
||
loader.sampler.set_epoch(epoch)
|
||
```
|
||
|
||
---
|
||
|
||
## 9. 离线服务器的模型加载
|
||
|
||
### 症状
|
||
```
|
||
OSError: Can't load tokenizer for 'hfl/chinese-macbert-large'.
|
||
```
|
||
服务器无法访问 HuggingFace,在线下载失败。
|
||
|
||
### 解决方案
|
||
|
||
**方法一:本地下载后 scp**
|
||
```powershell
|
||
# 本地下载
|
||
python -c "
|
||
from huggingface_hub import snapshot_download
|
||
snapshot_download('hfl/chinese-macbert-large', local_dir='D:/models/macbert-large')
|
||
"
|
||
# 上传到服务器
|
||
scp -P PORT -r D:\models\macbert-large root@HOST:/remote/models/macbert-large
|
||
```
|
||
|
||
**方法二:用国内镜像(若服务器能访问)**
|
||
```bash
|
||
HF_ENDPOINT=https://hf-mirror.com \
|
||
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('hfl/chinese-macbert-large')"
|
||
```
|
||
|
||
**更新配置文件:**
|
||
```yaml
|
||
# 将 HuggingFace model id 改为本地绝对路径
|
||
model_name: "/root/path/to/macbert-large"
|
||
```
|
||
|
||
---
|
||
|
||
## 10. Shell 脚本跨平台问题(CRLF)
|
||
|
||
### 症状
|
||
```
|
||
/bin/bash^M: bad interpreter: No such file or directory
|
||
```
|
||
或脚本执行后立即退出,没有任何错误信息。
|
||
|
||
### 根因
|
||
Windows 上编辑/保存的 `.sh` 文件使用 CRLF(`\r\n`)换行,Linux 只认 LF(`\n`),`^M`(即 `\r`)被当作命令的一部分。
|
||
|
||
### 修复方案
|
||
|
||
**PowerShell 写入时强制 LF:**
|
||
```powershell
|
||
$content = @'
|
||
#!/bin/bash
|
||
cd /project/dir
|
||
ACCEL=/path/to/accelerate
|
||
nohup $ACCEL launch ... > log.txt 2>&1 &
|
||
echo "PID: $!"
|
||
'@
|
||
# 关键:用 Replace 去掉 \r,用 UTF8NoBOM 编码
|
||
[System.IO.File]::WriteAllText(
|
||
"D:\path\to\script.sh",
|
||
$content.Replace("`r`n", "`n"),
|
||
[System.Text.UTF8Encoding]::new($false)
|
||
)
|
||
```
|
||
|
||
**事后修复(在 Linux 服务器上):**
|
||
```bash
|
||
sed -i 's/\r//' script.sh
|
||
# 或
|
||
dos2unix script.sh
|
||
```
|
||
|
||
**验证:**
|
||
```bash
|
||
file script.sh # 应显示 "ASCII text" 而非 "CRLF line terminators"
|
||
```
|
||
|
||
---
|
||
|
||
## 11. Python 模块路径(PYTHONPATH)
|
||
|
||
### 症状
|
||
```
|
||
ModuleNotFoundError: No module named 'src'
|
||
```
|
||
项目结构是 `src/models/`,但脚本中 `from src.models import ...` 找不到。
|
||
|
||
### 根因
|
||
`accelerate launch` / `torchrun` 启动的子进程工作目录不一定是项目根目录,`sys.path` 不包含项目根目录。
|
||
|
||
### 解决方案
|
||
|
||
**方案一:启动时设置 PYTHONPATH(推荐)**
|
||
```bash
|
||
PYTHONPATH=/root/path/to/project accelerate launch scripts/train.py
|
||
```
|
||
|
||
**方案二:在脚本开头动态添加**
|
||
```python
|
||
import sys, os
|
||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||
```
|
||
|
||
**方案三:项目根目录加 `__init__.py`(不推荐,污染命名空间)**
|
||
|
||
### nohup + 非交互 SSH 陷阱(2026-05-20 实测)
|
||
|
||
**症状:**
|
||
```
|
||
ModuleNotFoundError: No module named 'src'
|
||
```
|
||
在本地通过脚本/Claude Code 发送 `ssh server5090 'nohup accelerate launch ...'` 时复现;直接 `ssh + attach` 进交互 shell 后手动运行不报错。
|
||
|
||
**根因:**
|
||
`accelerate launch` 通过 `torch.distributed.run` 启动多个 worker 子进程。这些子进程以 `fork`/`spawn` 方式创建,**不继承父进程的 `sys.path`,也不读取父 shell 的 `export`**。非交互 SSH 执行时父进程的 `~/.bashrc` 不一定被 source,`PYTHONPATH` 从未设置。
|
||
|
||
**修复:在同一命令中同时 `cd` 和设 `PYTHONPATH`**
|
||
```bash
|
||
# ✅ 正确:cd 和 PYTHONPATH 同一行,accelerate 子进程能继承
|
||
ssh server5090 "cd $PROJ && PYTHONPATH=$PROJ NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=1 \
|
||
nohup /opt/conda/envs/dlapo-py310-cu128/bin/accelerate launch \
|
||
--num_processes=4 --mixed_precision=bf16 \
|
||
scripts/train_detector.py --config configs/xxx.yaml \
|
||
> experiments/train.log 2>&1 &"
|
||
|
||
# ❌ 错误:export 在非交互 session 中不传给 accelerate 子进程
|
||
ssh server5090 "export PYTHONPATH=$PROJ; nohup accelerate launch ..."
|
||
```
|
||
|
||
**注意**:`NCCL_IB_DISABLE=1` 也需要加上(RTX 5090 InfiniBand 兼容性问题),不只是 `NCCL_P2P_DISABLE` 和 `NCCL_SHM_DISABLE`。
|
||
|
||
---
|
||
|
||
## 12. 可选依赖的优雅处理(wandb 等)
|
||
|
||
### 背景
|
||
`wandb` 有复杂的依赖树(`sentry-sdk`、`setproctitle` 等),在受限环境中难以安装。
|
||
|
||
### 推荐模式:try/except 导入 + 功能开关
|
||
|
||
**导入部分:**
|
||
```python
|
||
try:
|
||
import wandb
|
||
WANDB_AVAILABLE = True
|
||
except ImportError:
|
||
wandb = None
|
||
WANDB_AVAILABLE = False
|
||
```
|
||
|
||
**使用部分:**
|
||
```python
|
||
if use_wandb and WANDB_AVAILABLE:
|
||
wandb.log({"loss": loss})
|
||
elif use_wandb and not WANDB_AVAILABLE:
|
||
if step == 0:
|
||
print("[WARN] wandb not available, skipping logging")
|
||
```
|
||
|
||
**配置文件:**
|
||
```yaml
|
||
# 生产/受限环境
|
||
use_wandb: false
|
||
|
||
# 开发环境
|
||
use_wandb: true
|
||
```
|
||
|
||
这样即使 wandb 未安装,训练也能正常运行,不会因为一行 `import wandb` 而整个崩溃。
|
||
|
||
---
|
||
|
||
## 13. HuggingFace `hf download` 大文件卡死与 curl 替代方案
|
||
|
||
### 症状 A:stale lock 导致新进程永久等待
|
||
```
|
||
Still waiting to acquire lock on .../wildguard/.cache/huggingface/download/model-00001-of-00002.safetensors.lock (elapsed: 600.0 seconds)
|
||
```
|
||
`hf download` 把每个文件的下载锁写到 `<local-dir>/.cache/huggingface/download/*.lock`。若前一个下载进程崩溃/被杀,锁文件残留,新进程一直等待却不下载任何数据。
|
||
|
||
**修复:**
|
||
```bash
|
||
# 1. 找并杀掉所有残留 hf 进程(该服务器无 pgrep,用 ps aux)
|
||
ps aux | grep 'hf download' | grep -v grep | awk '{print $2}' | xargs kill 2>/dev/null
|
||
|
||
# 2. 删除 stale 锁(只删 .cache,不影响已下载的正式文件)
|
||
rm -rf <local-dir>/.cache
|
||
```
|
||
|
||
⚠️ **注意**:`hf download` 先把文件下到 `.cache/.../download/*.incomplete`,完成后才移到最终路径。删 `.cache` 前先 `ls -lh .cache/huggingface/download/*.incomplete`,确认有没有接近完成的大文件——有的话会丢失已下载进度。
|
||
|
||
### 症状 B:大文件连接卡死(wget / hf download)
|
||
```
|
||
Connecting to huggingface.co (huggingface.co)|173.252.108.3|:443...
|
||
```
|
||
文件停在连接中,或 `.incomplete` 长时间停在 8KB 不增长。
|
||
|
||
**根因**:`wget -e use_proxy=yes` 和 `hf download` 底层对走 HTTPS 代理的大文件连接不稳定——小文件(<1MB)能通,大 shard(>1GB)会挂。
|
||
|
||
### 解决方案:改用 `curl --proxy` 直接下载大文件
|
||
|
||
```bash
|
||
# 支持断点续传(-C -)、跟随 CDN 重定向(-L)
|
||
curl -L \
|
||
--proxy http://127.0.0.1:7890 \
|
||
-H "Authorization: Bearer <HF_TOKEN>" \
|
||
-C - \
|
||
"https://huggingface.co/<org>/<repo>/resolve/main/<filename>" \
|
||
-o /path/to/<filename>
|
||
|
||
# 后台运行
|
||
nohup curl -L --proxy http://127.0.0.1:7890 \
|
||
-H "Authorization: Bearer <HF_TOKEN>" -C - \
|
||
"https://huggingface.co/allenai/wildguard/resolve/main/model-00001-of-00002.safetensors" \
|
||
-o /path/to/wildguard/model-00001-of-00002.safetensors \
|
||
> /tmp/curl_dl.log 2>&1 &
|
||
```
|
||
|
||
### 小文件用 `hf download`,大文件用 `curl`
|
||
| 文件类型 | 推荐方式 |
|
||
|---------|---------|
|
||
| tokenizer/config 等(<10MB) | `hf download` 批量下载 |
|
||
| model shard(>1GB) | `curl -L --proxy ... -C -` 逐个下载 |
|
||
|
||
### HF 文件 URL 规律
|
||
```
|
||
https://huggingface.co/<org>/<repo>/resolve/main/<filename>
|
||
```
|
||
|
||
---
|
||
|
||
## 附:本项目服务器快速参考
|
||
|
||
| 项目 | 值 |
|
||
|------|-----|
|
||
| SSH | `ssh -p 22657 root@connected.svt.net.cn` |
|
||
| 备用 SSH | `ssh -p 20083 root@10.82.3.180` |
|
||
| 密码 | `yx123456` |
|
||
| conda 环境 | `dlapo-py310-cu128` |
|
||
| accelerate 路径 | `/opt/conda/envs/dlapo-py310-cu128/bin/accelerate` |
|
||
| 项目目录 | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/CompanionGuard-RL` |
|
||
| MacBERT 本地路径 | `/root/siton-data-2849d4ce327c4ccfb233ce33868fe7fe/zsy/macbert-large` |
|