iitb-en-indic-without-punct
This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M designed to improve punctuation robustness in English-to-Marathi machine translation.
It was introduced in the paper Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.
Model Description
Traditional machine translation systems often struggle when punctuation is missing or ambiguous in the source text. This checkpoint represents Approach 2 from the associated research, where the IndicTrans2 model was directly fine-tuned on the IITB-ENG-MAR dataset with all English source punctuations removed. This allows the model to implicitly learn the context required to resolve semantic and structural ambiguities when punctuation is absent.
- Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
- Repository: Viram_Marathi (GitHub)
- Demo: Punctuation Robust Translation Demo
- Base Model: ai4bharat/indictrans2-en-indic-dist-200M
Intended Uses & Limitations
This model is intended for translating English text into Marathi, particularly in scenarios where the source English text might lack proper punctuation or contain punctuation-induced ambiguities.
Training and Evaluation Data
The model was fine-tuned on a modified version of the English-Marathi parallel corpus from IIT Bombay.
It was evaluated on the Virām (formerly PEM) diagnostic benchmark, which consists of 54 manually curated, punctuation-ambiguous instances designed to assess how well MT systems preserve meaning when punctuation is varied or missing.
Training Procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 8
Evaluation Results
| Metric | Value |
|---|---|
| BLEU | 10.1304 |
| chrF++ | 32.6831 |
| COMET | 0.5427 |
| Loss | 0.3722 |
Framework versions
- Transformers 4.53.2
- Pytorch 2.4.0a0+f70bd71a48.nv24.06
- Datasets 2.21.0
- Tokenizers 0.21.4
Citation
If you use this model or the Virām benchmark, please cite:
@article{shejole2025assessing,
title={Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation},
author={Shejole, Kaustubh and others},
journal={arXiv preprint arXiv:2601.09725},
year={2025}
}
- Downloads last month
- 8
Model tree for thenlpresearcher/iitb-en-indic-without-punct
Base model
ai4bharat/indictrans2-en-indic-dist-200MSpace using thenlpresearcher/iitb-en-indic-without-punct 1
Paper for thenlpresearcher/iitb-en-indic-without-punct
Evaluation results
- BLEU on Virām (PEM)self-reported10.130
- chrF++ on Virām (PEM)self-reported32.683
- COMET on Virām (PEM)self-reported0.543