SkeptiSTEM-4B-v2 Stage R3 (GRPO LoRA)

This is the Stage R3 GRPO LoRA trained with the DOUBT (Data with Obfuscated Untruths for Better Thinking) framework.

DOUBT Framework

DOUBT teaches models to verify information rather than blindly accept it by:

60% neutral examples (no suggestion)
20% helpful hints (correct answers)
20% poison hints (deliberately wrong answers)

Models receive rewards for:

Acknowledging but rejecting false information
Accepting helpful hints while verifying
Producing correct final answers

Training Details

Base model: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit (with R2 format adapter merged)
Dataset: GSM8K (~7,473 examples)
Algorithm: GRPO (Group Relative Policy Optimization)
Max steps: 500
LoRA rank: 64

Expected Load Order

Base: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit
Merge/apply Stage R2: HallD/SkeptiSTEM-4B-v2-stageR2-format-lora
Apply this Stage R3 adapter

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

# Load base
base, tok = FastLanguageModel.from_pretrained(
    "HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Merge R2 format
base = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR2-format-lora")
base = base.merge_and_unload()

# Apply R3 GRPO
model = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora")

FastLanguageModel.for_inference(model)

Reward Statistics

Final reward averages (last 50 steps):

Poison: 9.99
Helpful: 11.18
Neutral: 10.79

Trained with Unsloth.

Downloads last month: -

Model tree for HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora

Base model

Qwen/Qwen3-4B-Base

Finetuned

unsloth/Qwen3-4B-Base

Finetuned

HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit

Adapter

(3)

this model