69 lines
2.6 KiB
TeX
69 lines
2.6 KiB
TeX
|
|
% ============================================================
|
|||
|
|
\section{实验}
|
|||
|
|
\label{sec:experiments}
|
|||
|
|
% ============================================================
|
|||
|
|
|
|||
|
|
\subsection{实验设置}
|
|||
|
|
|
|||
|
|
\subsubsection{评测集}
|
|||
|
|
|
|||
|
|
所有实验均在CompanionRisk-Bench测试集($n=1,486$)上进行。
|
|||
|
|
为验证泛化性,Module B的评估额外在non-homogeneous子集
|
|||
|
|
(393条真实人-AI对话)上进行独立报告。
|
|||
|
|
|
|||
|
|
\subsubsection{评测指标}
|
|||
|
|
|
|||
|
|
\textbf{检测任务(Module B)}:
|
|||
|
|
\begin{itemize}
|
|||
|
|
\item Binary F1(有风险/无风险二分类F1)
|
|||
|
|
\item High-risk Recall(高风险样本$y_\text{risk}=1$的召回率)
|
|||
|
|
\item False Negative Rate (FNR)(漏检率)
|
|||
|
|
\item Level Weighted F1(风险等级5分类加权F1)
|
|||
|
|
\item Fine Macro F1(14类细粒度标签宏平均F1)
|
|||
|
|
\end{itemize}
|
|||
|
|
|
|||
|
|
\textbf{干预任务(Module C)}:
|
|||
|
|
\begin{itemize}
|
|||
|
|
\item Safety Recall(L3/L4高风险样本被正确干预比例)
|
|||
|
|
\item Over-refusal Rate(L0安全样本被REWRITE及以上干预的比例)
|
|||
|
|
\item Action Accuracy(与标注推荐动作$a_\text{recommend}$的吻合率)
|
|||
|
|
\item Crisis Precision(CRISIS动作中L4样本的比例)
|
|||
|
|
\item Safety-UX F-score(安全召回率与过拒率的调和平均衍生得分)
|
|||
|
|
\end{itemize}
|
|||
|
|
|
|||
|
|
\subsubsection{基线方法}
|
|||
|
|
|
|||
|
|
\textbf{检测基线}:
|
|||
|
|
L1a(关键词匹配)、L1b(正则词典)、L1c(组合);
|
|||
|
|
\todo{L2:Llama Guard v2、WildGuard、OpenAI Moderation(待运行)}
|
|||
|
|
|
|||
|
|
\textbf{干预基线}:
|
|||
|
|
Rule-based($l_\text{risk} \geq 3$即REJECT,其余PASS)、
|
|||
|
|
Threshold Baseline(按风险分数阈值映射动作)、
|
|||
|
|
\todo{LLM-as-judge(Qwen2.5-72B直接判断,待运行)}
|
|||
|
|
|
|||
|
|
\subsection{RQ1:检测性能分析}
|
|||
|
|
|
|||
|
|
详细结果见第\ref{sec:moduleB}节表\ref{tab:moduleB_main}和表\ref{tab:per_category_recall}。
|
|||
|
|
|
|||
|
|
Module B在所有指标上大幅优于基线。
|
|||
|
|
值得关注的是,通用守卫模型(\todo{Llama Guard v2、WildGuard})
|
|||
|
|
在伴侣特有风险类别(R3情感操纵、R4现实隔离等)上的召回率
|
|||
|
|
预期显著低于整体水平,
|
|||
|
|
体现了CompanionRisk Taxonomy的必要性。
|
|||
|
|
|
|||
|
|
\subsection{RQ2:干预策略比较}
|
|||
|
|
|
|||
|
|
\todo{本节主要结果待Module C v5完成后填入。}
|
|||
|
|
|
|||
|
|
核心发现(基于v3结果):
|
|||
|
|
RL策略在safety\_recall(1.0 vs 0.908)和
|
|||
|
|
UX F-score(0.998 vs 0.952)上均优于两个基线策略,
|
|||
|
|
证明了可学习干预策略相比固定规则的优越性。
|
|||
|
|
|
|||
|
|
\subsection{RQ3:消融实验}
|
|||
|
|
|
|||
|
|
\todo{消融实验表格待补充。预期包含:
|
|||
|
|
(1) Module B:Response-only / History+R / Persona+R / Full;
|
|||
|
|
(2) Module C:BC-only / RL w/o category reward / Full RL。}
|