Files
CompanionGuard-RL/paper/sections/00_abstract.tex
zhangsiyuan 804ebd2f77 feat: add paper/ LaTeX draft, English data scripts, update progress docs
- paper/: 22-page LaTeX framework (7/10 sections complete, compiles cleanly)
  main.tex + 10 section files + refs.bib + compiled PDF (329KB)
- code/scripts/: three English dataset generation & merging scripts
  generate_english.py / generate_english_targeted.py / merge_v5.py
- CLAUDE.md: update paper writing status, add paper/ file map entry
- state.md: add section 8 paper writing progress (2026-05-15)
- .gitignore: add LaTeX build artifact exclusion rules

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:19:39 +08:00

21 lines
1.4 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

% 摘要(中文)
情感陪伴类AI平台如星野、Character.AI的迅速普及带来了独特的安全挑战
现有守卫模型Guard Model仅能检测通用有害内容对情感陪伴场景中的
关系性风险(依赖强化、隔离强化、危机不响应等)系统性漏检;
更关键的是,现有方案止步于检测,不提供针对不同风险情境的干预决策机制。
本文提出\textbf{CompanionGuard-RL}——首个将伴侣AI安全建模为
"检测+自适应干预"统一流水线的框架。
该框架包含两个串联模块:
1Module B一个基于MacBERT-Large与跨注意力机制的上下文感知风险检测器
在自建评测集CompanionRisk-Bench9,896条样本涵盖10类一级风险与14个细粒度标签
实现binary F1 = 0.9995、漏检率FNR = 0.0\%
2Module C一个基于行为克隆预热与PPO强化学习的自适应干预策略
在安全召回率safety\_recall = 1.0)和安全-体验综合得分UX F-score = 0.998)上
显著优于规则基线0.908/0.952)。
消融实验证明跨注意力上下文融合和RL策略优化的必要性。
CompanionRisk-Bench数据集和框架代码将公开发布
以推动情感陪伴AI安全领域的研究。
\vspace{0.5em}
\noindent\textbf{关键词:} 情感陪伴AI安全检测强化学习风险干预内容安全