Continual Harness: Online Adaptation for Self-Improving Foundation Agents Paper • 2605.09998 • Published 13 days ago • 17
MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models Paper • 2603.28590 • Published Mar 30 • 22