jz2023 commited on
Commit
2a90b90
·
verified ·
1 Parent(s): b05eac0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -16
README.md CHANGED
@@ -1,7 +1,6 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
-
5
  # Model Details
6
 
7
  Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
@@ -14,14 +13,32 @@ are not at the output of the network](https://ai.meta.com/research/publications/
14
  <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />
15
 
16
 
17
- | Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
18
- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
19
- | **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 224 | 16 | 32 |
20
- | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 224 | 16 | 32 |
21
- | **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
22
- | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
23
- | **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 448 | 14 | 72 |
24
- | | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 448 | 14 | 72 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
 
27
  # How to use
@@ -31,8 +48,8 @@ We provide the pretraining code in https://github.com/facebookresearch/perceptio
31
  ```shell
32
  git clone https://github.com/facebookresearch/perception_models.git
33
  cd perception_models
34
- conda create --name occhi-env python=3.12
35
- conda activate occhi-env
36
  # Install PyTorch
37
  pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
38
  # We use torchcodec for decoding videos into PyTorch tensors
@@ -40,14 +57,15 @@ conda install ffmpeg -c conda-forge
40
  pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
41
  pip install -e .
42
  ```
43
- ## Image and Textg Feature extraction with a Trained Model :robot:
 
44
  ```python
45
  import torch
46
- from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
47
  from PIL import Image
48
 
49
- model_name = 'PEv1-L14-336'
50
- pretrained='PATH_TO_PE_Core_L14_336'
51
 
52
  model, _, preprocess = create_model_and_transforms(
53
  model_name,
@@ -84,4 +102,4 @@ If you find our code useful for your research, please consider citing:
84
  journal={arXiv:xxx.xxxxx},
85
  year={2025}
86
  }
87
-
 
1
  ---
2
  license: apache-2.0
3
  ---
 
4
  # Model Details
5
 
6
  Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
 
13
  <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />
14
 
15
 
16
+ #### Model Configurations
17
+ PE core curently comes in 3 sizes. PE core G is the main checkpoint, with L and B models distilled from it.
18
+
19
+ | Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution / Context Len |
20
+ |:-----:|:------:|:------:|:-----:|:-----:|:----:|:-----:|:--------:|:-------------------------:|
21
+ | **B/16** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 224px |
22
+ | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 32 tokens |
23
+ | **L/14** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336px |
24
+ | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 32 tokens |
25
+ | **G/14** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 448px |
26
+ | | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 72 tokens |
27
+
28
+ All PE core models use an attention pooling block with 8 heads on top of the vision tower. The L and B models _additionally_ have a class token for global aggregation. See the paper for more details.
29
+
30
+
31
+
32
+ #### Model Performance
33
+ PE core obtains extremely strong results across the board on zero-shot image classification and retrieval _as well as_ zero-shot video classification and retrieval. We present a sample of its performance across those domains below.
34
+
35
+ | Model | IN-1k | IN-v2 | IN-A | ObjectNet | COCO-T2I | Kinetics-400 | VTT-T2I | Checkpoint |
36
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
37
+ | **B/16 @224** | 78.4 | 71.7 | 62.4 | 71.9 | 50.9 | 65.6 | 47.6 | [PE-Core-B16-224](https://huggingface.co/facebook/PE-Core-B16-224) |
38
+ | **L/14 @336** | 83.5 | 77.9 | 89.0 | 84.7 | 57.1 | 73.4 | 50.3 | [PE-Core-L14-336](https://huggingface.co/facebook/PE-Core-L14-336) |
39
+ | **G/14 @448** | 85.4 | 80.2 | 92.6 | 88.2 | 58.1 | 76.9 | 51.2 | [PE-Core-G14-448](https://huggingface.co/facebook/PE-Core-G14-448) |
40
+
41
+ PE core performs particularly well on the _hard_ benchmarks such as ObjectNet and ImageNet-A.
42
 
43
 
44
  # How to use
 
48
  ```shell
49
  git clone https://github.com/facebookresearch/perception_models.git
50
  cd perception_models
51
+ conda create --name perception_models python=3.12
52
+ conda activate perception_models
53
  # Install PyTorch
54
  pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
55
  # We use torchcodec for decoding videos into PyTorch tensors
 
57
  pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
58
  pip install -e .
59
  ```
60
+ This will install an editable version of repo, allowing you to make changes to the code without needing to reinstall the package every time.
61
+ ## Image and Textg Feature extraction with a Trained Model
62
  ```python
63
  import torch
64
+ from core.vision_encoder.factory import create_model_and_transforms, get_tokenizer
65
  from PIL import Image
66
 
67
+ model_name = 'PEv1-L14_336'
68
+ pretrained = 'PATH_TO_PE_Core_L14_336'
69
 
70
  model, _, preprocess = create_model_and_transforms(
71
  model_name,
 
102
  journal={arXiv:xxx.xxxxx},
103
  year={2025}
104
  }
105
+