eilamc14's picture
Update README.md
c3f8f15 verified
---
library_name: transformers
license: apache-2.0
base_model: google/pegasus-xsum
datasets:
- eilamc14/wikilarge-clean
language:
- en
tags:
- pegasus
- text-simplification
- WikiLarge
model-index:
- name: pegasus-xsum-text-simplification
results:
- task:
type: text2text-generation
name: Text Simplification
dataset:
name: ASSET
type: facebook/asset
url: https://huggingface.co/datasets/facebook/asset
split: test
metrics:
- type: SARI
value: 33.80
- type: FKGL
value: 9.23
- type: BERTScore
value: 87.54
- type: LENS
value: 62.46
- type: Identical ratio
value: 0.29
- type: Identical ratio (ci)
value: 0.29
- task:
type: text2text-generation
name: Text Simplification
dataset:
name: MEDEASI
type: cbasu/Med-EASi
url: https://huggingface.co/datasets/cbasu/Med-EASi
split: test
metrics:
- type: SARI
value: 32.68
- type: FKGL
value: 10.98
- type: BERTScore
value: 45.14
- type: LENS
value: 50.55
- type: Identical ratio
value: 0.30
- type: Identical ratio (ci)
value: 0.30
- task:
type: text2text-generation
name: Text Simplification
dataset:
name: OneStopEnglish
type: OneStopEnglish
url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus
split: advanced→elementary
metrics:
- type: SARI
value: 37.07
- type: FKGL
value: 8.66
- type: BERTScore
value: 77.77
- type: LENS
value: 60.97
- type: Identical ratio
value: 0.40
- type: Identical ratio (ci)
value: 0.40
---
# Model Card for Model ID
This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project.
## Model Details
### Model Description
Fine-tuned **sequence-to-sequence (encoder–decoder) Transformer** for **English text simplification**.
Trained on the dataset **`eilamc14/wikilarge-clean`** (cleaned WikiLarge-style pairs).
- **Model type:** Seq2Seq Transformer (encoder–decoder)
- **Language (NLP):** English
- **License:** `apache-2.0`
- **Finetuned from model:** `google/pegasus-xsum`
### Model Sources
- **Repository (code):** https://github.com/eilamc14/Simplify-This
- **Dataset:** https://huggingface.co/datasets/eilamc14/wikilarge-clean
- **Paper [optional]:**
- **Demo [optional]:**
## Uses
### Direct Use
The model is intended for **English text simplification**.
- **Input format:** `Simplify: <complex sentence>`
- **Output:** `<simplified sentence>`
**Typical uses**
- Research on automatic text simplification
- Benchmarking against other simplification systems
- Demos/prototypes that require simpler English rewrites
### Downstream Use
This repository already contains a **fine-tuned** model specialized for text simplification.
Further fine-tuning is **optional** and mainly relevant when:
- Adapting to a markedly different domain (e.g., medical/legal/news)
- Addressing specific failure modes (e.g., over/under-simplification, factual drops)
- Distilling/quantizing for deployment constraints
When fine-tuning further, keep the same input convention: `Simplify: <...>`.
### Out-of-Scope Use
Not intended for:
- Tasks unrelated to simplification (dialogue, translation etc.)
- Production use without additional safety filtering (no toxicity/bias mitigation)
- Languages other than English
- High-stakes settings (legal/medical advice, safety-critical decisions)
## Bias, Risks, and Limitations
The model was trained on **Wikipedia and Simple English Wikipedia** alignments (via WikiLarge).
As a result, it inherits the characteristics and limitations of this data:
- **Domain bias:** Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news).
- **Content bias:** Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these.
- **Simplification quality:** The model may:
- Over-simplify (drop important details)
- Under-simplify (retain complex phrasing)
- Produce ungrammatical or awkward rephrasings
- **Language limitation:** Only suitable for English. Applying to other languages is unsupported.
- **Safety limitation:** The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards.
### Recommendations
- **Evaluation required:** Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation).
- **Human oversight:** Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.).
- **Attribution:** Preserve source attribution where required (Wikipedia → CC BY-SA).
- **Not for high-stakes use:** Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation.
## How to Get Started with the Model
Load the model and tokenizer directly from the Hugging Face Hub:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "eilamc14/bart-base-text-simplification"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# Example input
PREFIX = "Simplify: "
text = "The committee deemed the proposal unnecessarily complicated."
# Tokenize and generate
inputs = tokenizer(PREFIX+text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
[WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset
### Training Procedure
- **Hardware:** NVIDIA L4 GPU on Google Colab
- **Objective:** Standard sequence-to-sequence cross-entropy loss
- **Training type:** Full fine-tuning of all parameters (no LoRA/PEFT used)
- **Batching:** Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader
- **Evaluation:** Monitored on the `validation` split with metrics (SARI and identical_ratio)
- **Stopping criteria:** Early stopping CallBack based on validation performance
#### Preprocessing
The dataset was preprocessed by prefixing each source sentence with **"Simplify: "** and tokenizing both the source (inputs) and target (labels).
#### Memory & Checkpointing
To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled:
```python
model.config.use_cache = False # required when using gradient checkpointing
model.gradient_checkpointing_enable() # saves memory at the cost of extra compute
```
**Notes**
- Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass.
- Gradient checkpointing trades **GPU memory ↓** for **training speed ↓** (extra recomputation).
- For **inference/evaluation**, re-enable the cache for faster generation:
```python
model.config.use_cache = True
```
#### Training Hyperparameters
The models were trained with Hugging Face `Seq2SeqTrainingArguments`.
Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved.
Below are the **typical defaults** used:
- **Epochs:** 5
- **Evaluation strategy:** every 300 steps
- **Save strategy:** every 300 steps (keep best model, `eval_loss` as criterion)
- **Learning rate:** ~3e-5
- **Batch size:** ~8-64 , depends on model size
- **Optimizer:** `adamw_torch_fused`
- **Precision:** bf16
- **Generation config (during eval):** `max_length=128`, `num_beams=4`, `predict_with_generate=True`
- **Other settings:**
- Weight decay: 0.01
- Label smoothing: 0.1
- Warmup ratio: 0.1
- Max grad norm: 0.5
- Dataloader workers: 8 (L4 GPU)
> Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly.
## Evaluation
### Testing Data
- [**ASSET**](https://huggingface.co/datasets/facebook/asset) (test subset)
- [**MEDEASI**](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset)
- [**OneStopEnglish**](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary)
### Metrics
- **Identical ratio** — share of outputs identical to the source, both normalized by basic, language-agnostic: strip, NFKC, collapse spaces
- **Identical ratio (ci)** — case insensitive identical ratio
- **SARI** — main simplification metric (higher is better)
- **FKGL** — readability grade level (lower is simpler)
- **BERTScore (F1)** — semantic similarity (higher is better)
- **LENS** — composite simplification quality score (higher is better)
### Generation Arguments
```python
gen_args = dict(
max_new_tokens=64,
num_beams=4,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True,
do_sample=False,
)
```
### Results
| Dataset | Identical ratio | Identical ratio (ci) | SARI | FKGL | BERTScore | LENS |
|--------------------|----------------:|---------------------:|------:|-----:|----------:|------:|
| **ASSET** | 0.29 | 0.29 | 33.80 | 9.23 | 87.54 | 62.46 |
| **MEDEASI** | 0.30 | 0.30 | 32.68 | 10.98| 45.14 | 50.55 |
| **OneStopEnglish** | 0.40 | 0.40 | 37.07 | 8.66 | 77.77 | 60.97 |
## Environmental Impact
- **Hardware Type:** Single NVIDIA L4 GPU (Google Colab)
- **Hours used:** Approx. 5–10
- **Cloud Provider:** Google Cloud (via Colab)
- **Compute Region:** Unknown (Google Colab dynamic allocation)
- **Carbon Emitted:** Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours.
## Citation
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]