---
language: en
license: mit
tags:
  - clip
  - vision-language
  - image-text
  - zero-shot
  - retrieval
pipeline_tag: zero-shot-image-classification
---

# LongCLIP: Unlocking the Long-Text Capability of CLIP

[![Paper](https://img.shields.io/badge/arXiv-2403.15378-b31b1b)](https://arxiv.org/abs/2403.15378)
[![Conference](https://img.shields.io/badge/ECCV-2024-blue)](https://eccv2024.ecva.net/)
[![GitHub](https://img.shields.io/badge/GitHub-creative-graphic-design/longclip--transformers-black)](https://github.com/creative-graphic-design/longclip-transformers)

## Model Description

LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from **77 to 248 tokens**, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks.

### Key Features

- 🔥 **Extended Context Length**: 248 tokens (3.2× longer than original CLIP)
- 🔥 **Strong Performance**: +20% R@5 on long-caption retrieval, +6% on standard retrieval
- 🔥 **Plug-and-Play**: Drop-in replacement for CLIP in existing workflows
- 🔥 **Two Model Sizes**: Base (LongCLIP-B) and Large (LongCLIP-L)

### Model Variants

| Model          | Text Encoder    | Vision Encoder   | Params | Projection Dim |
| -------------- | --------------- | ---------------- | ------ | -------------- |
| **LongCLIP-B** | 12 layers, 512d | 12 layers, 768d  | ~150M  | 512            |
| **LongCLIP-L** | 12 layers, 768d | 24 layers, 1024d | ~430M  | 768            |

## Uses

### Direct Use

LongCLIP can be used for:

- **Zero-shot image classification** with detailed text descriptions
- **Image-text retrieval** with long, descriptive captions
- **Text-to-image generation** (e.g., Stable Diffusion XL integration)
- **Visual question answering** with complex queries

### Downstream Use

LongCLIP serves as a backbone for:

- Vision-language models requiring long text understanding
- Multimodal retrieval systems
- Content-based image search engines
- Automated image captioning evaluation

## How to Use

### Installation

```bash
pip install "transformers[torch,torch-vision]"
```

### Quick Start

```python
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained(
    "creative-graphic-design/LongCLIP-B",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "creative-graphic-design/LongCLIP-B",
    trust_remote_code=True
)

# Prepare inputs
image = Image.open("your_image.jpg")
texts = [
    "A man is crossing the street with a red car parked nearby.",
    "A man is driving a car in an urban scene."
]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    max_length=248,
    padding="max_length"
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=-1)

print("Probabilities:", probs)
```

### Advanced Usage: Feature Extraction

```python
# Extract features separately (unnormalized)
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length")
image_inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    text_features = model.get_text_features(**text_inputs)
    image_features = model.get_image_features(**image_inputs)

    # Compute similarity (like original CLIP)
    logits = image_features @ text_features.T
    probs = logits.softmax(dim=-1)
```

### Comparison with Original CLIP

```python
# Original CLIP: max 77 tokens
clip_text = "A cat"

# LongCLIP: up to 248 tokens
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room."

# LongCLIP can handle both short and long texts effectively!
```

## Citation

If you use LongCLIP in your research, please cite:

```bibtex
@inproceedings{zhang2024longclip,
  title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
  author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}
}
```

## License

This model is released under the MIT License, consistent with the original CLIP model.

## Acknowledgments

- **OpenAI CLIP**: Foundation model and architecture
- **Original Authors**: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

## Model Card Contact

For questions and feedback, please open an issue on the [GitHub repository](https://github.com/creative-graphic-design/longclip-transformers).