DART-GUI-7B

Model Description

DART-GUI-7B is a vision-language model fine-tuned from UITARS-7B, specifically designed for GUI (Graphical User Interface) understanding and interaction tasks. Built on the Qwen2.5-VL architecture, this model demonstrates strong multimodal understanding capabilities for GUI-related applications.

Model Details

  • Base Model: UITARS-7B
  • Architecture: Qwen2.5-VL (Qwen2_5_VLForConditionalGeneration)
  • Parameters: 7B
  • Developed by: BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极)
  • Model Type: Vision-Language Model

Model Capabilities

  • Image understanding and description
  • GUI interface understanding
  • Multimodal dialogue
  • Visual question answering
  • GUI agent tasks

Usage

Installation

pip install transformers torch accelerate pillow

Loading the Model

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

# Load model and processor
processor = AutoProcessor.from_pretrained("your-org/dart-gui-7b")
model = AutoModelForVision2Seq.from_pretrained(
    "your-org/dart-gui-7b",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare input
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this interface"}
        ]
    }
]

# Process input and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, inputs = processor(
    text=[text],
    images=[image],
    videos=None,
    padding=True,
    return_tensors="pt"
).to(model.device)

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text[0])

Technical Details

Model Architecture

  • Model Type: Qwen2.5-VL
  • Hidden Size: 3584
  • Number of Attention Heads: 28
  • Number of Key-Value Heads: 4
  • Number of Hidden Layers: 28
  • Intermediate Size: 18944
  • Max Position Embeddings: 128000
  • Vocabulary Size: 152064

Vision Encoder

  • Depth: 32 layers
  • Hidden Size: 1280
  • Output Hidden Size: 3584
  • Number of Attention Heads: 16
  • Patch Size: 14x14

Training Information

  • Base Model: UITARS-7B
  • Fine-tuning Method: Fine-tuned from UITARS-7B
  • Training Data: GUI-related datasets
  • Training Approach: Multi-turn reinforcement learning for GUI agents via decoupled training and adaptive data curation

Limitations and Considerations

  • This model is fine-tuned from UITARS-7B and inherits the characteristics of the base model
  • The model is primarily designed for GUI understanding and interaction tasks
  • Users should ensure compliance with relevant usage agreements and legal regulations before use

Citation

If you use this model in your research, please cite:

@article{li2025efficient,
  title={Efficient multi-turn rl for gui agents via decoupled training and adaptive data curation},
  author={Li, Pengxiang and Hu, Zechen and Shang, Zirui and Wu, Jingrong and Liu, Yang and Liu, Hui and Gao, Zhi and Shi, Chenrui and Zhang, Bofei and Zhang, Zihao and others},
  journal={arXiv preprint arXiv:2509.23866},
  year={2025}
}

Acknowledgments

This model was jointly developed by BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极).

License

This model is licensed under Apache 2.0.

Contact

For questions or suggestions, please submit an Issue through the Hugging Face repository.

Downloads last month
10
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PengxiangLi/dart-gui-7b

Finetuned
(8)
this model