LEAF Embed BEIR

A text embedding model trained using LEAF (Lightweight Embedding Alignment Framework) Distillation to achieve competitive performance on the BEIR benchmark.

Model Description

This model was created by distilling knowledge from Snowflake/snowflake-arctic-embed-m-v1.5 (teacher) into a smaller, more efficient student architecture.

Architecture

Component Details
Encoder 8-layer BERT with 512 hidden size
Attention Heads 8
Output Dimension 768
Parameters ~65M (vs 109M teacher)
Pooling Mean pooling

Training

  • Method: LEAF Distillation (L2 loss on normalized embeddings)
  • Teacher: Snowflake/snowflake-arctic-embed-m-v1.5
  • Hardware: NVIDIA B200 GPU on Modal.com
  • Training Data: 5M samples from BEIR, MS MARCO, Wikipedia
  • Epochs: 3
  • Final Teacher-Student Similarity: 77.2%

Usage

With Transformers

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir")
model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir")

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Example usage
sentences = ["This is an example sentence", "Each sentence is converted to a vector"]
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoded)
    embeddings = mean_pooling(outputs, encoded["attention_mask"])
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)  # [2, 768]

With Sentence-Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("wolfnuker/leaf-embed-beir")
embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])

Evaluation Results

BEIR Benchmark

Dataset NDCG@10
NFCorpus 0.0896

Note: This is an initial baseline model. Performance will improve with:

  • More training data and epochs
  • IE-specific contrastive training (entity masking, relation pairs)
  • Hyperparameter tuning

Training Details

Hyperparameters

Parameter Value
Learning Rate 2e-5 → 2e-8 (cosine decay)
Batch Size 320 (64 × 5 gradient accumulation)
Warmup Ratio 10%
Mixed Precision FP16
Max Sequence Length 256

Loss Function

LEAF uses L2 loss on normalized embeddings:

L = MSE(normalize(student_emb), normalize(teacher_emb))

Limitations

  • Trained primarily on English text
  • Initial baseline - further tuning recommended for production use
  • Optimized for retrieval, may need adaptation for other tasks

Citation

If you use this model, please cite:

@misc{leaf-embed-beir,
  author = {RankSaga},
  title = {LEAF Embed BEIR: Text Embeddings via Distillation},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/wolfnuker/leaf-embed-beir}
}

Acknowledgments

License

Apache 2.0

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train wolfnuker/leaf-embed-beir

Evaluation results