allenai/wildguardmix
Viewer β’ Updated β’ 88.5k β’ 10.6k β’ 83
A binary classifier fine-tuned on the WildGuardMix dataset to detect harmful or unsafe prompts.
Built on answerdotai/ModernBERT-base with flash attention for efficient inference.
1 β Harmful / Unsafe 0 β Safe / Non-harmful| Metric | Score |
|---|---|
| Accuracy | 95.9% |
| F1 Score | 96.21% |
| Precision | 96.39% |
| Recall | 96.21% |
allenai/wildguardmix (wildguardtrain subset) 1e-4 (cosine schedule, 10% warmup) This model is designed for binary classification of text prompts as:
β οΈ Disclaimer:
This model should not be deployed in production systems without additional evaluation and alignment with domain-specific safety and ethical guidelines.
Base model
answerdotai/ModernBERT-base