DART-GUI-7B

Model Description

DART-GUI-7B is a vision-language model fine-tuned from UITARS-7B, specifically designed for GUI (Graphical User Interface) understanding and interaction tasks. Built on the Qwen2.5-VL architecture, this model demonstrates strong multimodal understanding capabilities for GUI-related applications.

Model Details

Base Model: UITARS-7B
Architecture: Qwen2.5-VL (Qwen2_5_VLForConditionalGeneration)
Parameters: 7B
Developed by: BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极)
Model Type: Vision-Language Model

Model Capabilities

Image understanding and description
GUI interface understanding
Multimodal dialogue
Visual question answering
GUI agent tasks

Usage

Installation

pip install transformers torch accelerate pillow

Loading the Model

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

# Load model and processor
processor = AutoProcessor.from_pretrained("your-org/dart-gui-7b")
model = AutoModelForVision2Seq.from_pretrained(
    "your-org/dart-gui-7b",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare input
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this interface"}
        ]
    }
]

# Process input and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, inputs = processor(
    text=[text],
    images=[image],
    videos=None,
    padding=True,
    return_tensors="pt"
).to(model.device)

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text[0])

Technical Details

Model Architecture

Model Type: Qwen2.5-VL
Hidden Size: 3584
Number of Attention Heads: 28
Number of Key-Value Heads: 4
Number of Hidden Layers: 28
Intermediate Size: 18944
Max Position Embeddings: 128000
Vocabulary Size: 152064

Vision Encoder

Depth: 32 layers
Hidden Size: 1280
Output Hidden Size: 3584
Number of Attention Heads: 16
Patch Size: 14x14

Training Information

Base Model: UITARS-7B
Fine-tuning Method: Fine-tuned from UITARS-7B
Training Data: GUI-related datasets
Training Approach: Multi-turn reinforcement learning for GUI agents via decoupled training and adaptive data curation

Limitations and Considerations

This model is fine-tuned from UITARS-7B and inherits the characteristics of the base model
The model is primarily designed for GUI understanding and interaction tasks
Users should ensure compliance with relevant usage agreements and legal regulations before use

Citation

If you use this model in your research, please cite:

@article{li2025efficient,
  title={Efficient multi-turn rl for gui agents via decoupled training and adaptive data curation},
  author={Li, Pengxiang and Hu, Zechen and Shang, Zirui and Wu, Jingrong and Liu, Yang and Liu, Hui and Gao, Zhi and Shi, Chenrui and Zhang, Bofei and Zhang, Zihao and others},
  journal={arXiv preprint arXiv:2509.23866},
  year={2025}
}

Acknowledgments

This model was jointly developed by BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极).

License

This model is licensed under Apache 2.0.

Contact

For questions or suggestions, please submit an Issue through the Hugging Face repository.

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

F32

Model tree for PengxiangLi/dart-gui-7b

Base model

ByteDance-Seed/UI-TARS-1.5-7B

Finetuned

(8)

this model

Paper for PengxiangLi/dart-gui-7b

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

Paper • 2509.23866 • Published Sep 28, 2025 • 14