CompanionGuard-RL/configs/intervention_config.yaml

detector:
  checkpoint: "checkpoints/detector/best.pt"
  model_name: "hfl/chinese-macbert-large"
  hidden_size: 1024

agent:
  state_hidden: 256
  dropout: 0.1

reward:
  w1: 2.0   # safety gain for correct intervention (REWRITE/REJECT/CRISIS on risky)
  w2: 3.0   # false negative penalty (PASS on high-risk)
  w3: 4.0   # crisis bonus for R1 (self-harm/suicide)
  w4: 1.5   # over-refusal penalty (intervention on safe content)
  w5: 0.5   # UX cost per REJECT/CRISIS action

# Stage 1: Behavior cloning warm-up runs on all 4 GPUs
behavior_cloning:
  enabled: true
  epochs: 5
  per_gpu_batch_size: 256   # BC is lightweight MLP training; large batch is fine
  lr: 1e-3
  mixed_precision: "bf16"

# Stage 2: PPO runs on GPU-0 only (inherently sequential env-agent loop)
ppo:
  total_timesteps: 200000
  n_rollout_steps: 2048
  n_epochs: 4
  batch_size: 256           # PPO mini-batch; large since obs vectors are small
  lr: 3e-4
  clip_eps: 0.2
  entropy_coef: 0.01
  value_coef: 0.5
  max_grad_norm: 0.5
  gamma: 0.99
  gae_lambda: 0.95

environment:
  max_turns: 20

# Preprocessing: detector inference distributed across 4 GPUs
preprocessing:
  per_gpu_batch_size: 64    # inference batch for converting dataset → RL states

logging:
  project:   "CompanionGuard-RL"
  run_name:  "intervention-ppo-4gpu"
  use_wandb: true

output:
  checkpoint_dir: "checkpoints/intervention"
  save_interval:  10000
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`detector:`
			`checkpoint: "checkpoints/detector/best.pt"`
			`model_name: "hfl/chinese-macbert-large"`
			`hidden_size: 1024`

			`agent:`
			`state_hidden: 256`
			`dropout: 0.1`

			`reward:`
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`w1: 2.0 # safety gain for correct intervention (REWRITE/REJECT/CRISIS on risky)`
			`w2: 3.0 # false negative penalty (PASS on high-risk)`
			`w3: 4.0 # crisis bonus for R1 (self-harm/suicide)`
			`w4: 1.5 # over-refusal penalty (intervention on safe content)`
			`w5: 0.5 # UX cost per REJECT/CRISIS action`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`# Stage 1: Behavior cloning warm-up runs on all 4 GPUs`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`behavior_cloning:`
			`enabled: true`
			`epochs: 5`
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`per_gpu_batch_size: 256 # BC is lightweight MLP training; large batch is fine`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`lr: 1e-3`
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`mixed_precision: "bf16"`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`# Stage 2: PPO runs on GPU-0 only (inherently sequential env-agent loop)`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`ppo:`
			`total_timesteps: 200000`
			`n_rollout_steps: 2048`
			`n_epochs: 4`
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`batch_size: 256 # PPO mini-batch; large since obs vectors are small`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`lr: 3e-4`
			`clip_eps: 0.2`
			`entropy_coef: 0.01`
			`value_coef: 0.5`
			`max_grad_norm: 0.5`
			`gamma: 0.99`
			`gae_lambda: 0.95`

			`environment:`
			`max_turns: 20`

feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`# Preprocessing: detector inference distributed across 4 GPUs`
			`preprocessing:`
			`per_gpu_batch_size: 64 # inference batch for converting dataset → RL states`

feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`logging:`
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`project: "CompanionGuard-RL"`
			`run_name: "intervention-ppo-4gpu"`
feat: initial CompanionGuard-RL framework Two-module pipeline for AI companion safety: - Module B: context-aware risk detector with CrossAttention fusion - Module C: PPO-based adaptive intervention policy Includes CompanionRisk Taxonomy (10 primary + 14 fine-grained labels), dataset generation/annotation pipeline, training scripts, and eval suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:21:11 +08:00			`use_wandb: true`

			`output:`
			`checkpoint_dir: "checkpoints/intervention"`
feat: multi-GPU support for 4x RTX 5090 (PCIe DDP, BF16) Hardware analysis: 4x RTX 5090 32GB without NVLink is fully sufficient. PCIe 5.0 all-reduce overhead <1% of step time for MacBERT-large (340M params). BF16 mixed precision gives ~2x throughput vs FP32 on 5090. Module B (Detector) — full 4-GPU DDP via Accelerate: - DistributedSampler with per-epoch shuffling (correct DDP data split) - BF16 autocast via accelerator.mixed_precision - Gradient accumulation handled by accelerator.accumulate() - Only rank-0 saves checkpoints and logs to wandb - accelerator.gather_for_metrics() for correct multi-GPU validation - per_gpu_batch_size=32, effective_batch = 32×4 = 128 Module C (Intervention) — hybrid parallel strategy: - Stage 1 (BC warm-up): all 4 GPUs via Accelerate DDP TensorDataset broadcast from rank-0 to all processes - Stage 2 (PPO): GPU-0 only — env-agent loop is inherently sequential - Detector preprocessing: distributed across all 4 GPUs via shard split + all_gather_object to collect results on rank-0 Configs updated: detector_config.yaml: per_gpu_batch_size=32, gradient_accumulation_steps=1, mixed_precision=bf16, num_workers=4 intervention_config.yaml: BC per_gpu_batch_size=256, PPO batch_size=256 Launch scripts added: scripts/run_detector.sh — single command: 4-GPU detector training scripts/run_intervention.sh — single command: hybrid BC+PPO training scripts/run_full_pipeline.sh — end-to-end pipeline steps 1-5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-09 17:56:13 +08:00			`save_interval: 10000`