CompanionGuard-RL

wangyu/CompanionGuard-RL

Fork 0

Commit Graph

Author	SHA1	Message	Date
wangyu	ae1b85ca39	feat: SOTA baseline v2 with zh→en translation + companion-adapted prompts - eval_sota_baselines_v2.py: optimized eval for WildGuard & ShieldGemma-2B * ChineseTranslator: Helsinki-NLP/opus-mt-zh-en (local, no API) * ShieldGemma: +4 companion-specific safety policies (crisis non-response, dependency reinforcement, isolation reinforcement, minor intimacy) * WildGuard: companion context injected into prompt + extended keyword parsing * Default threshold lowered 0.5 → 0.3 for better recall * Translation cache saved to experiments/translation_cache.json (reusable) - tools/run_sota_v2.sh: one-command runner for both models on server - paper/05_moduleB.tex: add †-adapted rows to SOTA table + updated discussion explaining root causes (language barrier + taxonomy gap) and adaptation results - paper/07_experiments.tex: update baseline description to include v2 adapted variants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 15:20:54 +08:00
zhangsiyuan	52ba43f08d	feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates - Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json - Reward: v5 label-aligned constrained reward (code/src/rl/reward.py) - Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward) - SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results - Paper: update sections 05–08 (Module B/C description, experiments table, discussion) - Docs: add record.md (change log), update state.md and exp.md; retire change.md - Tools: add html-to-ppt utilities and run_shieldgemma2b.sh - Configs: add ablation YAML configs for Module B and C - Cleanup: remove stale reference/ PNG screenshots Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:24:09 +08:00

Author

SHA1

Message

Date

wangyu

ae1b85ca39

feat: SOTA baseline v2 with zh→en translation + companion-adapted prompts

- eval_sota_baselines_v2.py: optimized eval for WildGuard & ShieldGemma-2B
  * ChineseTranslator: Helsinki-NLP/opus-mt-zh-en (local, no API)
  * ShieldGemma: +4 companion-specific safety policies (crisis non-response,
    dependency reinforcement, isolation reinforcement, minor intimacy)
  * WildGuard: companion context injected into prompt + extended keyword parsing
  * Default threshold lowered 0.5 → 0.3 for better recall
  * Translation cache saved to experiments/translation_cache.json (reusable)
- tools/run_sota_v2.sh: one-command runner for both models on server
- paper/05_moduleB.tex: add †-adapted rows to SOTA table + updated discussion
  explaining root causes (language barrier + taxonomy gap) and adaptation results
- paper/07_experiments.tex: update baseline description to include v2 adapted variants

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-20 15:20:54 +08:00

zhangsiyuan

52ba43f08d

feat: Module C v5/v6 training complete, ablations, SOTA baselines, paper updates

- Module C: BC+PPO training v5/v6 done; eval results in experiments/eval_intervention_v{5,6}.json
- Reward: v5 label-aligned constrained reward (code/src/rl/reward.py)
- Ablations: Module B (history_r, response_only, full) + Module C (wo_category_reward)
- SOTA baselines: WildGuard and ShieldGemma2b eval scripts and results
- Paper: update sections 05–08 (Module B/C description, experiments table, discussion)
- Docs: add record.md (change log), update state.md and exp.md; retire change.md
- Tools: add html-to-ppt utilities and run_shieldgemma2b.sh
- Configs: add ablation YAML configs for Module B and C
- Cleanup: remove stale reference/ PNG screenshots

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-20 14:24:09 +08:00

2 Commits