PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems Paper • 2606.22388 • Published 4 days ago • 85 • 3
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning Paper • 2602.19455 • Published Feb 23 • 1
Enterprise Agents and Benchmarks Collection Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation • 21 items • Updated about 23 hours ago • 17
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published 30 days ago • 9
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Paper • 2606.12674 • Published 15 days ago • 5
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 7 days ago • 39
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 7 days ago • 39
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 7 days ago • 39
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Paper • 2606.12674 • Published 15 days ago • 5
view reply Appreciate the nice writeup. Can we add a) Leaderboard, b) Benchmark https://github.com/IBM/AssetOpsBench
view article Article Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic ibm-research • 23 days ago • 88
view article Article ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM ibm-research • 28 days ago • 17
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published 30 days ago • 9
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Paper • 2605.24219 • Published 30 days ago • 9
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge Paper • 2605.08518 • Published May 8 • 11 • 2