Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
Abstract
Being-H0.5 is a Vision-Language-Action model that enables robust cross-embodiment generalization through human-centric learning and a Mixture-of-Transformers architecture with specialized embodiment handling.
We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that treats human interaction traces as a universal "mother tongue" for physical interaction. To support this, we present UniHand-2.0, the largest embodied pre-training recipe to date, comprising over 35,000 hours of multimodal data across 30 distinct robotic embodiments. Our approach introduces a Unified Action Space that maps heterogeneous robot controls into semantically aligned slots, enabling low-resource robots to bootstrap skills from human data and high-resource platforms. Built upon this human-centric foundation, we design a unified sequential modeling and multi-task pre-training paradigm to bridge human demonstrations and robotic execution. Architecturally, Being-H0.5 utilizes a Mixture-of-Transformers design featuring a novel Mixture-of-Flow (MoF) framework to decouple shared motor primitives from specialized embodiment-specific experts. Finally, to make cross-embodiment policies stable in the real world, we introduce Manifold-Preserving Gating for robustness under sensory shift and Universal Async Chunking to universalize chunked control across embodiments with different latency and control profiles. We empirically demonstrate that Being-H0.5 achieves state-of-the-art results on simulated benchmarks, such as LIBERO (98.9%) and RoboCasa (53.9%), while also exhibiting strong cross-embodiment capabilities on five robotic platforms.
Community
We scale human-centric robot learning with Being-H0.5 toward cross-embodiment generalization. Building on over 35,000 hours data, we unify human hand motion and diverse robot embodiments with a Unified Action Space, and train all heterogeneous supervision through unified sequence modeling under a single framework. This yields a single foundation model that can perceive, describe, and act within one framework, enabling robust cross-embodiment generalization and real-world deployment across diverse robots and tasks.
Blog: https://research.beingbeyond.com/being-h05
arXiv: https://arxiv.org/pdf/2601.12993
GitHub: https://github.com/BeingBeyond/Being-H
HuggingFace: https://huggingface.co/collections/BeingBeyond/being-h05
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos (2026)
- Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training (2025)
- See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations (2025)
- InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation (2026)
- Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation (2025)
- Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives (2025)
- An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/being-h05-scaling-human-centric-robot-learning-for-cross-embodiment-generalization
Models citing this paper 4
Datasets citing this paper 9
Browse 9 datasets citing this paperSpaces citing this paper 0
No Space linking this paper