OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Abstract
OpenSubject, a large-scale video-derived dataset, enhances subject-driven image generation and manipulation through a four-stage pipeline that maintains identity fidelity and handles complex scenes.
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
Community
OpenSubject is a video-derived large-scale corpus for subject-driven generation and manipulation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition (2025)
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation (2025)
- Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model (2025)
- PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards (2025)
- LayerComposer: Multi-Human Personalized Generation via Layered Canvas (2025)
- ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation (2025)
- WithAnyone: Towards Controllable and ID Consistent Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper