oguzhanercan 's Collections Image-Video MultiModal Understanding
updated
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal
Perturbation and Learning Stabilization
Paper
• 2501.01245
• Published
• 5
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
• 2501.00599
• Published
• 46
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
Parameter-Inverted Image Pyramid Networks for Visual Perception and
Multimodal Understanding
Paper
• 2501.07783
• Published
• 8
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published
• 61
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Paper
• 2503.10596
• Published
• 18
Large-scale Pre-training for Grounded Video Caption Generation
Paper
• 2503.10781
• Published
• 16
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
Diffusion Refinement
Paper
• 2504.01934
• Published
• 22
LiveVQA: Live Visual Knowledge Seeking
Paper
• 2504.05288
• Published
• 15
Paper
• 2504.07491
• Published
• 137
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published
• 20
Caption Anything in Video: Fine-grained Object-centric Captioning via
Spatiotemporal Multimodal Prompting
Paper
• 2504.05541
• Published
• 15
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published
• 13
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published
• 43
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Paper
• 2504.10443
• Published
• 3
Describe Anything: Detailed Localized Image and Video Captioning
Paper
• 2504.16072
• Published
• 64
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published
• 155
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
• 2505.05464
• Published
• 11
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Paper
• 2505.08751
• Published
• 13
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published
• 98
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Paper
• 2505.16933
• Published
• 34
LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Paper
• 2505.16839
• Published
• 13
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper
• 2506.09344
• Published
• 31
Is Extending Modality The Right Path Towards Omni-Modality?
Paper
• 2506.01872
• Published
• 24
Hidden in plain sight: VLMs overlook their visual representations
Paper
• 2506.08008
• Published
• 7
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction
and Planning
Paper
• 2506.09985
• Published
• 31
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
Towards Multimodal Understanding via Stable Diffusion as a Task-Aware
Feature Extractor
Paper
• 2507.07106
• Published
• 2
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World
Shorts
Paper
• 2507.20939
• Published
• 57
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published
• 47
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in
Large Vision-Language Models
Paper
• 2510.09008
• Published
• 16
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Paper
• 2510.09608
• Published
• 51
OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit
Flows
Paper
• 2510.03506
• Published
• 15
V-Thinker: Interactive Thinking with Images
Paper
• 2511.04460
• Published
• 97
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Paper
• 2511.01617
• Published
• 3
VideoSSR: Video Self-Supervised Reinforcement Learning
Paper
• 2511.06281
• Published
• 25
Towards Universal Video Retrieval: Generalizing Video Embedding via
Synthesized Multimodal Pyramid Curriculum
Paper
• 2510.27571
• Published
• 19
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation
Models
Paper
• 2511.02712
• Published
• 5