Papers
arxiv:2603.01571

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Published on Mar 2
· Submitted by
Qiyuan Zhang
on Mar 4
Authors:
,
,
,
,
,
,
,

Abstract

Generative Reward Models can be improved by structuring Chain-of-Thought reasoning into breadth and depth components and optimizing them through supervised fine-tuning and reinforcement learning with verifiable rewards.

AI-generated summary

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.

Community

Paper submitter

🚀 Is making CoT longer really the silver bullet for Reward Models?

As long-cot dominates the LLM landscape, the standard approach to improving Generative Reward Models (LLM-as-a-Judge) has been straightforward: just force the model to generate longer reasoning traces. But does "one size fit all"?

In our new paper, "Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models," we prove that when it comes to evaluation, structure matters just as much as length.

🔥 The Core Problem:
Real-world evaluation is fundamentally divided:

  1. Subjective Preference (e.g., Chat): Requires Breadth (B-CoT)—evaluating multiple dimensions like tone, format, and helpfulness simultaneously.

  2. Objective Correctness (e.g., Math/Code): Requires Depth (D-CoT)—rigorous, step-by-step deductive verification.

Forcing a model to "think longer" on a subjective chat task often just accumulates noise, while using broad aspects on a math problem misses critical logical flaws.

💡 Enter Mix-GRM & Key Discoveries:

  1. 🧠 Synergizing Structures: We designed a framework that equips the GRM with both Breadth (B-CoT) and Depth (D-CoT) reasoning capabilities.

2.⚡ "Emergent Polarization": We trained the model using Reinforcement Learning (RLVR) relying exclusively on final verdict supervision—with zero explicit routing labels. Amazingly, the model's structural alignment surged to 95%. It autonomously learned to polarize its reasoning, dynamically selecting Breadth for Preference and Depth for Correctness.

  1. 📉 Highly Compute-Efficient: Unlike length-scaling baselines (like Self-Consistency) that burn massive amounts of tokens, Mix-GRM achieves superior performance while keeping token consumption within the exact same order of magnitude as standard single-pass reasoning.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.01571 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.01571 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.01571 in a Space README.md to link it from this page.

Collections including this paper 1