|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- clip |
|
|
- vision-language |
|
|
- image-text |
|
|
- zero-shot |
|
|
- retrieval |
|
|
pipeline_tag: zero-shot-image-classification |
|
|
--- |
|
|
|
|
|
# LongCLIP: Unlocking the Long-Text Capability of CLIP |
|
|
|
|
|
[](https://arxiv.org/abs/2403.15378) |
|
|
[](https://eccv2024.ecva.net/) |
|
|
[](https://github.com/creative-graphic-design/longclip-transformers) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from **77 to 248 tokens**, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- 🔥 **Extended Context Length**: 248 tokens (3.2× longer than original CLIP) |
|
|
- 🔥 **Strong Performance**: +20% R@5 on long-caption retrieval, +6% on standard retrieval |
|
|
- 🔥 **Plug-and-Play**: Drop-in replacement for CLIP in existing workflows |
|
|
- 🔥 **Two Model Sizes**: Base (LongCLIP-B) and Large (LongCLIP-L) |
|
|
|
|
|
### Model Variants |
|
|
|
|
|
| Model | Text Encoder | Vision Encoder | Params | Projection Dim | |
|
|
| -------------- | --------------- | ---------------- | ------ | -------------- | |
|
|
| **LongCLIP-B** | 12 layers, 512d | 12 layers, 768d | ~150M | 512 | |
|
|
| **LongCLIP-L** | 12 layers, 768d | 24 layers, 1024d | ~430M | 768 | |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
LongCLIP can be used for: |
|
|
|
|
|
- **Zero-shot image classification** with detailed text descriptions |
|
|
- **Image-text retrieval** with long, descriptive captions |
|
|
- **Text-to-image generation** (e.g., Stable Diffusion XL integration) |
|
|
- **Visual question answering** with complex queries |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
LongCLIP serves as a backbone for: |
|
|
|
|
|
- Vision-language models requiring long text understanding |
|
|
- Multimodal retrieval systems |
|
|
- Content-based image search engines |
|
|
- Automated image captioning evaluation |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install "transformers[torch,torch-vision]" |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoProcessor |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model and processor |
|
|
model = AutoModel.from_pretrained( |
|
|
"creative-graphic-design/LongCLIP-B", |
|
|
trust_remote_code=True |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained( |
|
|
"creative-graphic-design/LongCLIP-B", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Prepare inputs |
|
|
image = Image.open("your_image.jpg") |
|
|
texts = [ |
|
|
"A man is crossing the street with a red car parked nearby.", |
|
|
"A man is driving a car in an urban scene." |
|
|
] |
|
|
|
|
|
inputs = processor( |
|
|
text=texts, |
|
|
images=image, |
|
|
return_tensors="pt", |
|
|
max_length=248, |
|
|
padding="max_length" |
|
|
) |
|
|
|
|
|
# Get predictions |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits_per_image = outputs.logits_per_image |
|
|
probs = logits_per_image.softmax(dim=-1) |
|
|
|
|
|
print("Probabilities:", probs) |
|
|
``` |
|
|
|
|
|
### Advanced Usage: Feature Extraction |
|
|
|
|
|
```python |
|
|
# Extract features separately (unnormalized) |
|
|
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length") |
|
|
image_inputs = processor(images=image, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
text_features = model.get_text_features(**text_inputs) |
|
|
image_features = model.get_image_features(**image_inputs) |
|
|
|
|
|
# Compute similarity (like original CLIP) |
|
|
logits = image_features @ text_features.T |
|
|
probs = logits.softmax(dim=-1) |
|
|
``` |
|
|
|
|
|
### Comparison with Original CLIP |
|
|
|
|
|
```python |
|
|
# Original CLIP: max 77 tokens |
|
|
clip_text = "A cat" |
|
|
|
|
|
# LongCLIP: up to 248 tokens |
|
|
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room." |
|
|
|
|
|
# LongCLIP can handle both short and long texts effectively! |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use LongCLIP in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{zhang2024longclip, |
|
|
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP}, |
|
|
author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi}, |
|
|
booktitle={European Conference on Computer Vision (ECCV)}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License, consistent with the original CLIP model. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **OpenAI CLIP**: Foundation model and architecture |
|
|
- **Original Authors**: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions and feedback, please open an issue on the [GitHub repository](https://github.com/creative-graphic-design/longclip-transformers). |
|
|
|