EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
Abstract
EvasionBench introduces a large-scale benchmark for detecting evasive responses in earnings calls using a multi-model annotation framework that leverages disagreement between advanced language models to identify challenging examples, resulting in a highly accurate model with significantly reduced inference costs.
Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
Community
Thanks for featuring our work! ๐ EvasionBench aims to bridge the gap in financial transparency. We've released the Eva-4B model and the 1k human-annotated test set.
๐ Paper: https://arxiv.org/abs/2601.09142
๐ค Model: https://huggingface.co/FutureMa/Eva-4B
๐ฎ Demo: https://huggingface.co/spaces/FutureMa/financial-evasion-detection
Feel free to ask any questions!
I'm sharing our latest work on detecting evasive answers in earnings calls.
Key Highlights:
- EvasionBench: A large-scale benchmark (30k training / 1k human test).
Disagreement Mining: A novel annotation framework where LLM disagreement identifies high-value training samples. - Eva-4B: A lightweight model that achieves 81.3% accuracy, outperforming many closed-source frontier models.
We have open-sourced the model and demo. Happy to answer any questions about the labeling protocol or the financial NLP aspect! ๐น
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hard Negative Sample-Augmented DPO Post-Training for Small Language Models (2025)
- Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering (2026)
- EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference (2025)
- FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-Checking (2025)
- Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment (2025)
- Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets (2025)
- All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper