LongCLIP-B / README.md
shunk031's picture
Upload README.md with huggingface_hub
6adccc8 verified
---
language: en
license: mit
tags:
- clip
- vision-language
- image-text
- zero-shot
- retrieval
pipeline_tag: zero-shot-image-classification
---
# LongCLIP: Unlocking the Long-Text Capability of CLIP
[![Paper](https://img.shields.io/badge/arXiv-2403.15378-b31b1b)](https://arxiv.org/abs/2403.15378)
[![Conference](https://img.shields.io/badge/ECCV-2024-blue)](https://eccv2024.ecva.net/)
[![GitHub](https://img.shields.io/badge/GitHub-creative-graphic-design/longclip--transformers-black)](https://github.com/creative-graphic-design/longclip-transformers)
## Model Description
LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from **77 to 248 tokens**, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks.
### Key Features
- 🔥 **Extended Context Length**: 248 tokens (3.2× longer than original CLIP)
- 🔥 **Strong Performance**: +20% R@5 on long-caption retrieval, +6% on standard retrieval
- 🔥 **Plug-and-Play**: Drop-in replacement for CLIP in existing workflows
- 🔥 **Two Model Sizes**: Base (LongCLIP-B) and Large (LongCLIP-L)
### Model Variants
| Model | Text Encoder | Vision Encoder | Params | Projection Dim |
| -------------- | --------------- | ---------------- | ------ | -------------- |
| **LongCLIP-B** | 12 layers, 512d | 12 layers, 768d | ~150M | 512 |
| **LongCLIP-L** | 12 layers, 768d | 24 layers, 1024d | ~430M | 768 |
## Uses
### Direct Use
LongCLIP can be used for:
- **Zero-shot image classification** with detailed text descriptions
- **Image-text retrieval** with long, descriptive captions
- **Text-to-image generation** (e.g., Stable Diffusion XL integration)
- **Visual question answering** with complex queries
### Downstream Use
LongCLIP serves as a backbone for:
- Vision-language models requiring long text understanding
- Multimodal retrieval systems
- Content-based image search engines
- Automated image captioning evaluation
## How to Use
### Installation
```bash
pip install "transformers[torch,torch-vision]"
```
### Quick Start
```python
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModel.from_pretrained(
"creative-graphic-design/LongCLIP-B",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"creative-graphic-design/LongCLIP-B",
trust_remote_code=True
)
# Prepare inputs
image = Image.open("your_image.jpg")
texts = [
"A man is crossing the street with a red car parked nearby.",
"A man is driving a car in an urban scene."
]
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
max_length=248,
padding="max_length"
)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=-1)
print("Probabilities:", probs)
```
### Advanced Usage: Feature Extraction
```python
# Extract features separately (unnormalized)
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length")
image_inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
text_features = model.get_text_features(**text_inputs)
image_features = model.get_image_features(**image_inputs)
# Compute similarity (like original CLIP)
logits = image_features @ text_features.T
probs = logits.softmax(dim=-1)
```
### Comparison with Original CLIP
```python
# Original CLIP: max 77 tokens
clip_text = "A cat"
# LongCLIP: up to 248 tokens
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room."
# LongCLIP can handle both short and long texts effectively!
```
## Citation
If you use LongCLIP in your research, please cite:
```bibtex
@inproceedings{zhang2024longclip,
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}
```
## License
This model is released under the MIT License, consistent with the original CLIP model.
## Acknowledgments
- **OpenAI CLIP**: Foundation model and architecture
- **Original Authors**: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
## Model Card Contact
For questions and feedback, please open an issue on the [GitHub repository](https://github.com/creative-graphic-design/longclip-transformers).