|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
base_model: google/pegasus-xsum |
|
|
datasets: |
|
|
- eilamc14/wikilarge-clean |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- pegasus |
|
|
- text-simplification |
|
|
- WikiLarge |
|
|
model-index: |
|
|
- name: pegasus-xsum-text-simplification |
|
|
results: |
|
|
- task: |
|
|
type: text2text-generation |
|
|
name: Text Simplification |
|
|
dataset: |
|
|
name: ASSET |
|
|
type: facebook/asset |
|
|
url: https://huggingface.co/datasets/facebook/asset |
|
|
split: test |
|
|
metrics: |
|
|
- type: SARI |
|
|
value: 33.80 |
|
|
- type: FKGL |
|
|
value: 9.23 |
|
|
- type: BERTScore |
|
|
value: 87.54 |
|
|
- type: LENS |
|
|
value: 62.46 |
|
|
- type: Identical ratio |
|
|
value: 0.29 |
|
|
- type: Identical ratio (ci) |
|
|
value: 0.29 |
|
|
|
|
|
- task: |
|
|
type: text2text-generation |
|
|
name: Text Simplification |
|
|
dataset: |
|
|
name: MEDEASI |
|
|
type: cbasu/Med-EASi |
|
|
url: https://huggingface.co/datasets/cbasu/Med-EASi |
|
|
split: test |
|
|
metrics: |
|
|
- type: SARI |
|
|
value: 32.68 |
|
|
- type: FKGL |
|
|
value: 10.98 |
|
|
- type: BERTScore |
|
|
value: 45.14 |
|
|
- type: LENS |
|
|
value: 50.55 |
|
|
- type: Identical ratio |
|
|
value: 0.30 |
|
|
- type: Identical ratio (ci) |
|
|
value: 0.30 |
|
|
|
|
|
- task: |
|
|
type: text2text-generation |
|
|
name: Text Simplification |
|
|
dataset: |
|
|
name: OneStopEnglish |
|
|
type: OneStopEnglish |
|
|
url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus |
|
|
split: advanced→elementary |
|
|
metrics: |
|
|
- type: SARI |
|
|
value: 37.07 |
|
|
- type: FKGL |
|
|
value: 8.66 |
|
|
- type: BERTScore |
|
|
value: 77.77 |
|
|
- type: LENS |
|
|
value: 60.97 |
|
|
- type: Identical ratio |
|
|
value: 0.40 |
|
|
- type: Identical ratio (ci) |
|
|
value: 0.40 |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
Fine-tuned **sequence-to-sequence (encoder–decoder) Transformer** for **English text simplification**. |
|
|
Trained on the dataset **`eilamc14/wikilarge-clean`** (cleaned WikiLarge-style pairs). |
|
|
|
|
|
- **Model type:** Seq2Seq Transformer (encoder–decoder) |
|
|
- **Language (NLP):** English |
|
|
- **License:** `apache-2.0` |
|
|
- **Finetuned from model:** `google/pegasus-xsum` |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository (code):** https://github.com/eilamc14/Simplify-This |
|
|
- **Dataset:** https://huggingface.co/datasets/eilamc14/wikilarge-clean |
|
|
- **Paper [optional]:** — |
|
|
- **Demo [optional]:** — |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The model is intended for **English text simplification**. |
|
|
|
|
|
- **Input format:** `Simplify: <complex sentence>` |
|
|
- **Output:** `<simplified sentence>` |
|
|
|
|
|
**Typical uses** |
|
|
- Research on automatic text simplification |
|
|
- Benchmarking against other simplification systems |
|
|
- Demos/prototypes that require simpler English rewrites |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
This repository already contains a **fine-tuned** model specialized for text simplification. |
|
|
|
|
|
Further fine-tuning is **optional** and mainly relevant when: |
|
|
- Adapting to a markedly different domain (e.g., medical/legal/news) |
|
|
- Addressing specific failure modes (e.g., over/under-simplification, factual drops) |
|
|
- Distilling/quantizing for deployment constraints |
|
|
|
|
|
When fine-tuning further, keep the same input convention: `Simplify: <...>`. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
Not intended for: |
|
|
- Tasks unrelated to simplification (dialogue, translation etc.) |
|
|
- Production use without additional safety filtering (no toxicity/bias mitigation) |
|
|
- Languages other than English |
|
|
- High-stakes settings (legal/medical advice, safety-critical decisions) |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
The model was trained on **Wikipedia and Simple English Wikipedia** alignments (via WikiLarge). |
|
|
As a result, it inherits the characteristics and limitations of this data: |
|
|
|
|
|
- **Domain bias:** Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news). |
|
|
- **Content bias:** Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these. |
|
|
- **Simplification quality:** The model may: |
|
|
- Over-simplify (drop important details) |
|
|
- Under-simplify (retain complex phrasing) |
|
|
- Produce ungrammatical or awkward rephrasings |
|
|
- **Language limitation:** Only suitable for English. Applying to other languages is unsupported. |
|
|
- **Safety limitation:** The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards. |
|
|
|
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- **Evaluation required:** Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation). |
|
|
- **Human oversight:** Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.). |
|
|
- **Attribution:** Preserve source attribution where required (Wikipedia → CC BY-SA). |
|
|
- **Not for high-stakes use:** Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Load the model and tokenizer directly from the Hugging Face Hub: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
|
|
model_id = "eilamc14/bart-base-text-simplification" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id) |
|
|
|
|
|
# Example input |
|
|
PREFIX = "Simplify: " |
|
|
text = "The committee deemed the proposal unnecessarily complicated." |
|
|
|
|
|
# Tokenize and generate |
|
|
inputs = tokenizer(PREFIX+text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
[WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Hardware:** NVIDIA L4 GPU on Google Colab |
|
|
- **Objective:** Standard sequence-to-sequence cross-entropy loss |
|
|
- **Training type:** Full fine-tuning of all parameters (no LoRA/PEFT used) |
|
|
- **Batching:** Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader |
|
|
- **Evaluation:** Monitored on the `validation` split with metrics (SARI and identical_ratio) |
|
|
- **Stopping criteria:** Early stopping CallBack based on validation performance |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
The dataset was preprocessed by prefixing each source sentence with **"Simplify: "** and tokenizing both the source (inputs) and target (labels). |
|
|
|
|
|
#### Memory & Checkpointing |
|
|
|
|
|
To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled: |
|
|
|
|
|
```python |
|
|
model.config.use_cache = False # required when using gradient checkpointing |
|
|
model.gradient_checkpointing_enable() # saves memory at the cost of extra compute |
|
|
``` |
|
|
|
|
|
**Notes** |
|
|
- Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass. |
|
|
- Gradient checkpointing trades **GPU memory ↓** for **training speed ↓** (extra recomputation). |
|
|
- For **inference/evaluation**, re-enable the cache for faster generation: |
|
|
|
|
|
```python |
|
|
model.config.use_cache = True |
|
|
``` |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
The models were trained with Hugging Face `Seq2SeqTrainingArguments`. |
|
|
Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved. |
|
|
Below are the **typical defaults** used: |
|
|
|
|
|
- **Epochs:** 5 |
|
|
- **Evaluation strategy:** every 300 steps |
|
|
- **Save strategy:** every 300 steps (keep best model, `eval_loss` as criterion) |
|
|
- **Learning rate:** ~3e-5 |
|
|
- **Batch size:** ~8-64 , depends on model size |
|
|
- **Optimizer:** `adamw_torch_fused` |
|
|
- **Precision:** bf16 |
|
|
- **Generation config (during eval):** `max_length=128`, `num_beams=4`, `predict_with_generate=True` |
|
|
- **Other settings:** |
|
|
- Weight decay: 0.01 |
|
|
- Label smoothing: 0.1 |
|
|
- Warmup ratio: 0.1 |
|
|
- Max grad norm: 0.5 |
|
|
- Dataloader workers: 8 (L4 GPU) |
|
|
|
|
|
> Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data |
|
|
|
|
|
- [**ASSET**](https://huggingface.co/datasets/facebook/asset) (test subset) |
|
|
- [**MEDEASI**](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset) |
|
|
- [**OneStopEnglish**](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary) |
|
|
|
|
|
### Metrics |
|
|
|
|
|
- **Identical ratio** — share of outputs identical to the source, both normalized by basic, language-agnostic: strip, NFKC, collapse spaces |
|
|
- **Identical ratio (ci)** — case insensitive identical ratio |
|
|
- **SARI** — main simplification metric (higher is better) |
|
|
- **FKGL** — readability grade level (lower is simpler) |
|
|
- **BERTScore (F1)** — semantic similarity (higher is better) |
|
|
- **LENS** — composite simplification quality score (higher is better) |
|
|
|
|
|
### Generation Arguments |
|
|
|
|
|
```python |
|
|
gen_args = dict( |
|
|
max_new_tokens=64, |
|
|
num_beams=4, |
|
|
length_penalty=1.0, |
|
|
no_repeat_ngram_size=3, |
|
|
early_stopping=True, |
|
|
do_sample=False, |
|
|
) |
|
|
``` |
|
|
|
|
|
### Results |
|
|
|
|
|
| Dataset | Identical ratio | Identical ratio (ci) | SARI | FKGL | BERTScore | LENS | |
|
|
|--------------------|----------------:|---------------------:|------:|-----:|----------:|------:| |
|
|
| **ASSET** | 0.29 | 0.29 | 33.80 | 9.23 | 87.54 | 62.46 | |
|
|
| **MEDEASI** | 0.30 | 0.30 | 32.68 | 10.98| 45.14 | 50.55 | |
|
|
| **OneStopEnglish** | 0.40 | 0.40 | 37.07 | 8.66 | 77.77 | 60.97 | |
|
|
|
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** Single NVIDIA L4 GPU (Google Colab) |
|
|
- **Hours used:** Approx. 5–10 |
|
|
- **Cloud Provider:** Google Cloud (via Colab) |
|
|
- **Compute Region:** Unknown (Google Colab dynamic allocation) |
|
|
- **Carbon Emitted:** Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours. |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
**APA:** |
|
|
|
|
|
[More Information Needed] |