CompanionGuard-RL

Files

wangyu 4a0e71fb23 refactor: complete full implementation replacing all placeholder/mock content

Detection module (Module B):
- detector.py: expose separate e_P_pool and e_H_pool for RL state;
  fix compute_loss to skip primary head when c_primary="None"
- dataset.py: handle c_primary="None" safely; add validate_and_normalize

Data pipeline:
- data_generator.py: 30+ category-specific personas (3+ per R1-R10 + 5 safe);
  systematic category→fine-label mapping; safe sample generation (25%);
  per-category risk level distribution; max_retries logic
- llm_judge.py: incremental file writing; rate limiting; retry logic;
  annotate_from_file convenience method; consistency validation
- annotate_data.py: stratified split by y_risk; dataset statistics report

RL module (Module C):
- ppo_trainer.py: fix Gymnasium API (reset→(obs,info), step→5-tuple);
  fix action type passed to env.step; proper buffer reset and size tracking
- companion_env.py: use shared build_obs_vector; add BatchCompanionEnv with
  auto-reset; correct Gymnasium interface

Shared utilities (new files):
- src/utils/preprocessing.py: preprocess_samples_with_detector using separate
  e_P_pool/e_H_pool; build_obs_vector; build_bc_tensors for BC warm-up
- src/utils/baselines.py: KeywordDetector (L1a), RegexDetector (L1b),
  CombinedRuleDetector (L1c), rule_based_intervention, threshold_intervention,
  LLMJudgePolicy for full baseline comparison

Scripts:
- train_intervention.py: use preprocessing module; separate e_H/e_P pools
- evaluate.py: proper module imports (no circular scripts import);
  full multi-baseline comparison; save all results to JSON
- generate_data.py: API key check; safe_ratio + max_retries CLI args

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-09 17:50:17 +08:00

annotate_data.py

refactor: complete full implementation replacing all placeholder/mock content

2026-05-09 17:50:17 +08:00

evaluate.py

refactor: complete full implementation replacing all placeholder/mock content

2026-05-09 17:50:17 +08:00

generate_data.py

refactor: complete full implementation replacing all placeholder/mock content

2026-05-09 17:50:17 +08:00

train_detector.py

feat: initial CompanionGuard-RL framework

2026-05-09 17:21:11 +08:00

train_intervention.py

refactor: complete full implementation replacing all placeholder/mock content

2026-05-09 17:50:17 +08:00