ChartCap: Mitigating Hallucination of Dense Chart Captioning

This repository contains the model presented in the paper ChartCap: Mitigating Hallucination of Dense Chart Captioning.

Project Page: https://junyoung-00.github.io/ChartCap/
Code: https://github.com/junyoung-00/ChartCap

Model Description

Phi-3.5-vision-instruct-ChartCap is a ChartCap-fine-tuned version of microsoft/Phi-3.5-vision-instruct.

The model aims to generate high-quality, dense captions for charts, ensuring that the generated text accurately captures structural elements and key insights discernible from the charts, while mitigating the inclusion of extraneous or hallucinated information.

Required Packages

flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0

How to Use

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import torch

model_id = "junyoung-00/Phi-3.5-vision-instruct-ChartCap" 

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

# Load an example chart image (URL or local path)
image_url = "https://your-server.com/example_chart.png"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

# Define the prompt for dense chart captioning
prompt = "Please provide a detailed caption for the chart."
messages = [
    {"role": "user", "content": f"<|image|>
{prompt}"}
]

# Apply chat template and prepare inputs
input_ids = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
# The image token handling for Phi3V can sometimes be specific, ensure correct placeholder handling if <|image|> is mapped.
# For simplicity, we use the standard processor input which handles image embedding.
inputs = processor(text=input_ids, images=image, return_tensors="pt").to(model.device)


# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)

# Decode and print the output
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response.strip())

Citation

If you find this model or the associated research helpful, please cite:

@inproceedings{lim2025chartcap,
  title = {ChartCap: Mitigating Hallucination of Dense Chart Captioning},
  author = {Junyoung Lim and Jaewoo Ahn and Gunhee Kim},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year = {2025}
}

Downloads last month: 269

Safetensors

Model size

4B params

Tensor type

F16

Paper for junyoung-00/Phi-3.5-vision-instruct-ChartCap

ChartCap: Mitigating Hallucination of Dense Chart Captioning

Paper • 2508.03164 • Published Aug 5, 2025 • 6