Bart (From Scratch) Spanish to Portuguese (ES to PT)
This repository contains an encoder–decoder Transformer (BART-style) trained from scratch for Spanish to Portuguese translation using the Helsinki-NLP/Tatoeba dataset (es-pt).
Model details
- Task: Machine Translation (ES to PT)
- Architecture: BART-style encoder–decoder Transformer (trained from scratch)
- Tokenizer: Subword BPE (32k vocab)
- Max sequence length: 256 (source) / 256 (target)
- Decoding used for evaluation: Beam search (beam=5)
Architecture summary
| Component | Value |
|---|---|
| vocab size | 32,000 |
| d_model | 512 |
| encoder layers | 6 |
| decoder layers | 6 |
| attention heads | 8 |
| FFN dim | 2048 |
| dropout | 0.1 |
| parameters | ~61.6M |
Dataset
- Dataset:
Helsinki-NLP/tatoeba - Config / language pair:
es-pt - Splits used: official
train,validation,test - Train/Val/Test sizes: 63,716 / 1,998 / 1,999
- Leakage prevention: tokenizer trained on training split only; duplicate (src,tgt) pairs removed via hashing.
Dataset link: https://huggingface.co/datasets/Helsinki-NLP/tatoeba
Evaluation
Metric: chrF (generation-based)
| Split | chrF |
|---|---|
| Validation | 70.6691 |
| Test | 70.4862 |
Note: chrF is character n-gram based and is suitable for evaluating adequacy in Romance language translation tasks such as ES↔PT.
How to use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "liansheng06/bart-tatoeba-es-pt"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
text = "Las personas dicen que estoy loco."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model.generate(**inputs, num_beams=5, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Demo
Try the interactive Gradio demo here:
https://huggingface.co/spaces/liansheng06/ATA-Assignment2
- Downloads last month
- 29