SkeptiSTEM-4B-v2 Stage R3 (GRPO LoRA)

This is the Stage R3 GRPO LoRA trained with the DOUBT (Data with Obfuscated Untruths for Better Thinking) framework.

DOUBT Framework

DOUBT teaches models to verify information rather than blindly accept it by:

  • 60% neutral examples (no suggestion)
  • 20% helpful hints (correct answers)
  • 20% poison hints (deliberately wrong answers)

Models receive rewards for:

  • Acknowledging but rejecting false information
  • Accepting helpful hints while verifying
  • Producing correct final answers

Training Details

  • Base model: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit (with R2 format adapter merged)
  • Dataset: GSM8K (~7,473 examples)
  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Max steps: 500
  • LoRA rank: 64

Expected Load Order

  1. Base: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit
  2. Merge/apply Stage R2: HallD/SkeptiSTEM-4B-v2-stageR2-format-lora
  3. Apply this Stage R3 adapter

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

# Load base
base, tok = FastLanguageModel.from_pretrained(
    "HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Merge R2 format
base = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR2-format-lora")
base = base.merge_and_unload()

# Apply R3 GRPO
model = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora")

FastLanguageModel.for_inference(model)

Reward Statistics

Final reward averages (last 50 steps):

  • Poison: 9.99
  • Helpful: 11.18
  • Neutral: 10.79

Trained with Unsloth.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora

Base model

Qwen/Qwen3-4B-Base
Adapter
(3)
this model