What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models Paper • 2601.06165 • Published Jan 7 • 16
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context Paper • 2604.13058 • Published Mar 18 • 1
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 9 days ago • 76
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs Paper • 2605.09063 • Published 9 days ago • 76
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math Paper • 2602.06291 • Published Feb 6 • 24
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math Paper • 2602.06291 • Published Feb 6 • 24
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought Paper • 2510.04230 • Published Oct 5, 2025 • 27
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published May 17, 2025 • 10
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published May 17, 2025 • 10
Won: Establishing Best Practices for Korean Financial NLP Paper • 2503.17963 • Published Mar 23, 2025
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning Paper • 2502.17407 • Published Feb 24, 2025 • 25
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning Paper • 2502.17407 • Published Feb 24, 2025 • 25
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do Paper • 2409.11239 • Published Sep 17, 2024 • 3
Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap Paper • 2501.02448 • Published Jan 5, 2025
FINNUMBER/FINCH_TRAIN_ALL_3600_per400_NEW_Rationale_E16 Text Generation • 6B • Updated Feb 28, 2024 • 1
FINNUMBER/FINCH_TRAIN_ALL_3600_per400_NEW_Rationale_E12 Text Generation • 6B • Updated Feb 28, 2024 • 3