DART-GUI-7B
Model Description
DART-GUI-7B is a vision-language model fine-tuned from UITARS-7B, specifically designed for GUI (Graphical User Interface) understanding and interaction tasks. Built on the Qwen2.5-VL architecture, this model demonstrates strong multimodal understanding capabilities for GUI-related applications.
Model Details
- Base Model: UITARS-7B
- Architecture: Qwen2.5-VL (Qwen2_5_VLForConditionalGeneration)
- Parameters: 7B
- Developed by: BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极)
- Model Type: Vision-Language Model
Model Capabilities
- Image understanding and description
- GUI interface understanding
- Multimodal dialogue
- Visual question answering
- GUI agent tasks
Usage
Installation
pip install transformers torch accelerate pillow
Loading the Model
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
# Load model and processor
processor = AutoProcessor.from_pretrained("your-org/dart-gui-7b")
model = AutoModelForVision2Seq.from_pretrained(
"your-org/dart-gui-7b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Prepare input
image = Image.open("your_image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this interface"}
]
}
]
# Process input and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, inputs = processor(
text=[text],
images=[image],
videos=None,
padding=True,
return_tensors="pt"
).to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text[0])
Technical Details
Model Architecture
- Model Type: Qwen2.5-VL
- Hidden Size: 3584
- Number of Attention Heads: 28
- Number of Key-Value Heads: 4
- Number of Hidden Layers: 28
- Intermediate Size: 18944
- Max Position Embeddings: 128000
- Vocabulary Size: 152064
Vision Encoder
- Depth: 32 layers
- Hidden Size: 1280
- Output Hidden Size: 3584
- Number of Attention Heads: 16
- Patch Size: 14x14
Training Information
- Base Model: UITARS-7B
- Fine-tuning Method: Fine-tuned from UITARS-7B
- Training Data: GUI-related datasets
- Training Approach: Multi-turn reinforcement learning for GUI agents via decoupled training and adaptive data curation
Limitations and Considerations
- This model is fine-tuned from UITARS-7B and inherits the characteristics of the base model
- The model is primarily designed for GUI understanding and interaction tasks
- Users should ensure compliance with relevant usage agreements and legal regulations before use
Citation
If you use this model in your research, please cite:
@article{li2025efficient,
title={Efficient multi-turn rl for gui agents via decoupled training and adaptive data curation},
author={Li, Pengxiang and Hu, Zechen and Shang, Zirui and Wu, Jingrong and Liu, Yang and Liu, Hui and Gao, Zhi and Shi, Chenrui and Zhang, Bofei and Zhang, Zihao and others},
journal={arXiv preprint arXiv:2509.23866},
year={2025}
}
Acknowledgments
This model was jointly developed by BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极).
License
This model is licensed under Apache 2.0.
Contact
For questions or suggestions, please submit an Issue through the Hugging Face repository.
- Downloads last month
- 10
Model tree for PengxiangLi/dart-gui-7b
Base model
ByteDance-Seed/UI-TARS-1.5-7B