RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
Paper
•
2505.12224
•
Published
RoboFAC-7B is a large-scale vision-language model specifically finetuned for robotic failure understanding and correction. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
The model is intended to be used in robotic systems as an external critic, to:
The model can be integrated into:
from transformers import AutoProcessor, AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
# Example usage with image frames and a question
inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs, skip_special_tokens=True))
BibTeX:
@misc{lu2025robofaccomprehensiveframeworkrobotic,
title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
year={2025},
eprint={2505.12224},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.12224}
}
Base model
Qwen/Qwen2.5-VL-7B-Instruct