facebook
/

PE-Core-L14-336

Zero-Shot Image Classification

PerceptionEncoder

Model card Files Files and versions

xet

Community

jz2023 commited on Apr 17, 2025

Commit

2a90b90

verified ·

1 Parent(s): b05eac0

Update README.md

Browse files

Files changed (1) hide show

README.md +34 -16

README.md CHANGED Viewed

@@ -1,7 +1,6 @@
 ---
 license: apache-2.0
 ---
 # Model Details
 Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
@@ -14,14 +13,32 @@ are not at the output of the network](https://ai.meta.com/research/publications/
 <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />
-| Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 224 | 16 | 32 |
-|  | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 224 | 16 | 32 |
-| **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
-|  | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
-| **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 448 | 14 | 72 |
-|  | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 448 | 14 | 72 |
 # How to use
@@ -31,8 +48,8 @@ We provide the pretraining code in https://github.com/facebookresearch/perceptio
 ```shell
 git clone https://github.com/facebookresearch/perception_models.git
 cd perception_models
-conda create --name occhi-env python=3.12
-conda activate occhi-env
 # Install PyTorch
 pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
 # We use torchcodec for decoding videos into PyTorch tensors
@@ -40,14 +57,15 @@ conda install ffmpeg -c conda-forge
 pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
 pip install -e .
 ```
-## Image and Textg Feature extraction with a Trained Model :robot:
 ```python
 import torch
-from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
 from PIL import Image
-model_name = 'PEv1-L14-336'
-pretrained='PATH_TO_PE_Core_L14_336'
 model, _, preprocess = create_model_and_transforms(
     model_name,
@@ -84,4 +102,4 @@ If you find our code useful for your research, please consider citing:
 	    journal={arXiv:xxx.xxxxx},
 	    year={2025}
 	}

 ---
 license: apache-2.0
 ---
 # Model Details
 Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
 <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />
+#### Model Configurations
+PE core curently comes in 3 sizes. PE core G is the main checkpoint, with L and B models distilled from it.
+| Scale | Tower  | Params | Width | Depth | MLP  | Heads | CLIP Dim |  Resolution / Context Len |
+|:-----:|:------:|:------:|:-----:|:-----:|:----:|:-----:|:--------:|:-------------------------:|
+| **B/16** | Vision | 0.09B  | 768   | 12    | 3072 | 12    | 1024  | 224px                     |
+|       | Text   | 0.31B  | 1024  | 24    | 4096 | 16    | 1024     | 32 tokens                 |
+| **L/14** | Vision | 0.32B  | 1024  | 24    | 4096 | 16    | 1024  | 336px                     |
+|       | Text   | 0.31B  | 1024  | 24    | 4096 | 16    | 1024     | 32 tokens                 |
+| **G/14** | Vision | 1.88B  | 1536  | 50    | 8960 | 16    | 1280  | 448px                     |
+|       | Text   | 0.47B  | 1280  | 24    | 5120 | 20    | 1280     | 72 tokens                 |
+All PE core models use an attention pooling block with 8 heads on top of the vision tower. The L and B models _additionally_ have a class token for global aggregation. See the paper for more details.
+#### Model Performance
+PE core obtains extremely strong results across the board on zero-shot image classification and retrieval _as well as_ zero-shot video classification and retrieval. We present a sample of its performance across those domains below.
+| Model | IN-1k | IN-v2 | IN-A | ObjectNet | COCO-T2I | Kinetics-400 | VTT-T2I | Checkpoint |
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| **B/16 @224** | 78.4 | 71.7 | 62.4 |  71.9 | 50.9 | 65.6 | 47.6 | [PE-Core-B16-224](https://huggingface.co/facebook/PE-Core-B16-224) |
+| **L/14 @336** | 83.5 | 77.9 | 89.0 | 84.7 | 57.1 | 73.4 | 50.3  | [PE-Core-L14-336](https://huggingface.co/facebook/PE-Core-L14-336) |
+| **G/14 @448** | 85.4 | 80.2 | 92.6 | 88.2 | 58.1 | 76.9 | 51.2  | [PE-Core-G14-448](https://huggingface.co/facebook/PE-Core-G14-448) |
+PE core performs particularly well on the _hard_ benchmarks such as ObjectNet and ImageNet-A.
 # How to use
 ```shell
 git clone https://github.com/facebookresearch/perception_models.git
 cd perception_models
+conda create --name perception_models python=3.12
+conda activate perception_models
 # Install PyTorch
 pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
 # We use torchcodec for decoding videos into PyTorch tensors
 pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
 pip install -e .
 ```
+This will install an editable version of repo, allowing you to make changes to the code without needing to reinstall the package every time.
+## Image and Textg Feature extraction with a Trained Model
 ```python
 import torch
+from core.vision_encoder.factory import create_model_and_transforms, get_tokenizer
 from PIL import Image
+model_name = 'PEv1-L14_336'
+pretrained = 'PATH_TO_PE_Core_L14_336'
 model, _, preprocess = create_model_and_transforms(
     model_name,
 	    journal={arXiv:xxx.xxxxx},
 	    year={2025}
 	}