Bart (From Scratch) Spanish to Portuguese (ES to PT)

This repository contains an encoder–decoder Transformer (BART-style) trained from scratch for Spanish to Portuguese translation using the Helsinki-NLP/Tatoeba dataset (es-pt).

Model details

  • Task: Machine Translation (ES to PT)
  • Architecture: BART-style encoder–decoder Transformer (trained from scratch)
  • Tokenizer: Subword BPE (32k vocab)
  • Max sequence length: 256 (source) / 256 (target)
  • Decoding used for evaluation: Beam search (beam=5)

Architecture summary

Component Value
vocab size 32,000
d_model 512
encoder layers 6
decoder layers 6
attention heads 8
FFN dim 2048
dropout 0.1
parameters ~61.6M

Dataset

  • Dataset: Helsinki-NLP/tatoeba
  • Config / language pair: es-pt
  • Splits used: official train, validation, test
  • Train/Val/Test sizes: 63,716 / 1,998 / 1,999
  • Leakage prevention: tokenizer trained on training split only; duplicate (src,tgt) pairs removed via hashing.

Dataset link: https://huggingface.co/datasets/Helsinki-NLP/tatoeba

Evaluation

Metric: chrF (generation-based)

Split chrF
Validation 70.6691
Test 70.4862

Note: chrF is character n-gram based and is suitable for evaluating adequacy in Romance language translation tasks such as ES↔PT.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "liansheng06/bart-tatoeba-es-pt"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "Las personas dicen que estoy loco."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

outputs = model.generate(**inputs, num_beams=5, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Demo

Try the interactive Gradio demo here:
https://huggingface.co/spaces/liansheng06/ATA-Assignment2

Downloads last month
29
Safetensors
Model size
61.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train liansheng06/bart-tatoeba-es-pt

Space using liansheng06/bart-tatoeba-es-pt 1