ProEdit: Inversion-based Editing From Prompts Done Right Paper • 2512.22118 • Published 9 days ago • 17
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation Paper • 2512.23576 • Published 6 days ago • 63
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models Paper • 2512.20557 • Published 12 days ago • 49
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data +7 Jun 3, 2025 • 305
view article Article Metric and Relative Monocular Depth Estimation: An Overview. Fine-Tuning Depth Anything V2 👐 📚 Jul 10, 2024 • 91
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Paper • 2507.04009 • Published Jul 5, 2025 • 51
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation Paper • 2508.08248 • Published Aug 11, 2025 • 27
LightLab: Controlling Light Sources in Images with Diffusion Models Paper • 2505.09608 • Published May 14, 2025 • 36
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning Paper • 2504.06958 • Published Apr 9, 2025 • 13
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper • 2504.16030 • Published Apr 22, 2025 • 36
Vidi: Large Multimodal Models for Video Understanding and Editing Paper • 2504.15681 • Published Apr 22, 2025 • 14