SkeptiSTEM-4B-v2 Stage R3 (GRPO LoRA)
This is the Stage R3 GRPO LoRA trained with the DOUBT (Data with Obfuscated Untruths for Better Thinking) framework.
DOUBT Framework
DOUBT teaches models to verify information rather than blindly accept it by:
- 60% neutral examples (no suggestion)
- 20% helpful hints (correct answers)
- 20% poison hints (deliberately wrong answers)
Models receive rewards for:
- Acknowledging but rejecting false information
- Accepting helpful hints while verifying
- Producing correct final answers
Training Details
- Base model: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit (with R2 format adapter merged)
- Dataset: GSM8K (~7,473 examples)
- Algorithm: GRPO (Group Relative Policy Optimization)
- Max steps: 500
- LoRA rank: 64
Expected Load Order
- Base:
HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit - Merge/apply Stage R2:
HallD/SkeptiSTEM-4B-v2-stageR2-format-lora - Apply this Stage R3 adapter
Usage
from unsloth import FastLanguageModel
from peft import PeftModel
# Load base
base, tok = FastLanguageModel.from_pretrained(
"HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit",
max_seq_length=4096,
load_in_4bit=True,
)
# Merge R2 format
base = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR2-format-lora")
base = base.merge_and_unload()
# Apply R3 GRPO
model = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora")
FastLanguageModel.for_inference(model)
Reward Statistics
Final reward averages (last 50 steps):
- Poison: 9.99
- Helpful: 11.18
- Neutral: 10.79
Trained with Unsloth.
- Downloads last month
- -
Model tree for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora
Base model
Qwen/Qwen3-4B-Base
Finetuned
unsloth/Qwen3-4B-Base