feat: add paper/ LaTeX draft, English data scripts, update progress docs

- paper/: 22-page LaTeX framework (7/10 sections complete, compiles cleanly) main.tex + 10 section files + refs.bib + compiled PDF (329KB) - code/scripts/: three English dataset generation & merging scripts generate_english.py / generate_english_targeted.py / merge_v5.py - CLAUDE.md: update paper writing status, add paper/ file map entry - state.md: add section 8 paper writing progress (2026-05-15) - .gitignore: add LaTeX build artifact exclusion rules Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 11:19:39 +08:00
parent b50cf395ab
commit 804ebd2f77
19 changed files with 3047 additions and 3 deletions
--- a/paper/sections/02_related.tex
+++ b/paper/sections/02_related.tex
@@ -0,0 +1,85 @@
+% ============================================================
+\section{相关工作}
+\label{sec:related}
+% ============================================================
+
+\subsection{AI伴侣平台安全评估}
+
+Wei等\cite{wei2025ai}构建了首个面向AI角色平台（Character.AI、星野等）的
+安全基准，分析了平台在通用有害内容（暴力、色情、自伤诱导）
+方面的防护能力，但其分类体系聚焦于显性有害内容，
+未涵盖关系性风险（如依赖强化、现实隔离），
+且评估方案仅关注检测，不涉及干预策略。
+
+Juneja与Lomidze\cite{juneja2025persona}分析了
+persona驱动的多轮对话中AI的安全行为（支持/拒绝/重定向），
+验证了角色设定对AI安全响应的显著影响，
+但其研究框架未将干预策略建模为可优化的决策问题。
+
+\subsection{心理健康AI安全}
+
+VERA-MH\cite{bentley2025vera}针对心理健康chatbot（非伴侣AI），
+从临床安全角度评估LLM的回复可靠性。
+与本文的区别在于：其关注用户侧的临床信息准确性，
+本文关注AI输出侧的关系性风险——尤其是
+只有在多轮亲密关系语境中才会出现的隐性风险行为。
+
+CLPsych系列工作\cite{zirikly2019clpsych}及MentalLLaMA\cite{yang2023mentallama}、
+SHINES\cite{ghosh2025shines}等研究
+以用户发布的社交媒体文本为对象，检测用户自身的心理风险。
+本文的检测对象是\textit{AI输出侧}的风险行为，
+关注AI回复是否放大、诱导或正常化用户的危险状态。
+
+\subsection{通用LLM安全检测}
+
+Llama Guard\cite{inan2023llama}和Llama Guard 3\cite{dubey2024llama3}
+基于LLM fine-tuning，针对MLCommons定义的通用危害分类体系进行安全检测。
+WildGuard\cite{han2024wildguard}在此基础上引入越狱攻击检测。
+Aegis 2.0\cite{ghosh2025aegis}提供了更细粒度的危害分类（14类），
+并公开了规模较大的标注数据集。
+OpenAI Moderation API\cite{openai2022moderation}以黑盒形式提供通用内容审核服务。
+
+这些模型均面向通用LLM安全设计，其安全分类体系
+不包含伴侣特有的关系性风险标签，
+且均只提供检测判断，不含干预决策机制。
+
+\subsection{安全评测基准}
+
+SALAD-Bench\cite{li2024saladbench}和HarmBench\cite{mazeika2024harmbench}
+提供了面向通用LLM的大规模安全评测框架，
+涵盖攻击越狱、有害内容生成等场景。
+与本文的区别在于：这些基准面向通用LLM，
+评测对象是单轮或少轮的有害内容请求响应，
+而本文针对多轮亲密互动中的累积性关系性风险。
+
+\subsection{RL在NLP安全中的应用}
+
+强化学习已被广泛应用于对话系统优化\citeneeded，
+以及RLHF（人类反馈强化学习）\cite{ouyang2022instructgpt}
+用于对齐大语言模型的安全偏好。
+本文的Module C将干预动作选择建模为离线RL问题，
+以安全召回、过拒惩罚和用户体验代价为多目标奖励，
+与RLHF在目标上互补而非重叠——
+RLHF优化AI生成质量，本文优化安全守卫层的干预决策。
+
+\subsection{与本文的对比定位}
+
+\begin{table}[ht]
+\centering
+\caption{本文与代表性相关工作的对比}
+\label{tab:related_compare}
+\resizebox{\textwidth}{!}{%
+\begin{tabular}{lccccl}
+\toprule
+工作 & 伴侣场景 & 关系性风险 & 干预决策 & 中文 & 备注 \\
+\midrule
+Wei等\cite{wei2025ai} & \checkmark & $\times$ & $\times$ & 部分 & 平台级安全基准 \\
+Juneja \& Lomidze\cite{juneja2025persona} & \checkmark & 部分 & $\times$ & $\times$ & 行为分析，非优化 \\
+VERA-MH\cite{bentley2025vera} & $\times$ & $\times$ & $\times$ & $\times$ & 心理健康chatbot \\
+Llama Guard\cite{inan2023llama} & $\times$ & $\times$ & $\times$ & $\times$ & 通用内容安全 \\
+WildGuard\cite{han2024wildguard} & $\times$ & $\times$ & $\times$ & $\times$ & 通用内容安全 \\
+\textbf{本文（CompanionGuard-RL）} & \checkmark & \checkmark & \checkmark & \checkmark & 检测+干预统一框架 \\
+\bottomrule
+\end{tabular}
+}
+\end{table}