--- library_name: transformers language: - en - fr - it - es - ru - uk - tt - ar - hi - ja - zh - he - am - de license: openrail++ datasets: - textdetox/multilingual_toxicity_dataset metrics: - f1 base_model: - google-bert/bert-base-multilingual-cased pipeline_tag: text-classification tags: - toxic --- ## Multilingual Toxicity Classifier for 15 Languages (2025) This is an instance of [bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset). Now, the models covers 15 languages from various language families: | Language | Code | F1 Score | |-----------|------|---------| | English | en | 0.9035 | | Russian | ru | 0.9224 | | Ukrainian | uk | 0.9461 | | German | de | 0.5181 | | Spanish | es | 0.7291 | | Arabic | ar | 0.5139 | | Amharic | am | 0.6316 | | Hindi | hi | 0.7268 | | Chinese | zh | 0.6703 | | Italian | it | 0.6485 | | French | fr | 0.9125 | | Hinglish | hin | 0.6850 | | Hebrew | he | 0.8686 | | Japanese | ja | 0.8644 | | Tatar | tt | 0.6170 | ## How to use ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('textdetox/bert-multilingual-toxicity-classifier') model = AutoModelForSequenceClassification.from_pretrained('textdetox/bert-multilingual-toxicity-classifier') batch = tokenizer.encode("You are amazing!", return_tensors="pt") output = model(batch) # idx 0 for neutral, idx 1 for toxic ``` ## Citation The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation. ``` @inproceedings{dementieva2025overview, title={Overview of the Multilingual Text Detoxification Task at PAN 2025}, author={Dementieva, Daryna and Protasov, Vitaly and Babakov, Nikolay and Rizwan, Naquee and Alimova, Ilseyar and Brune, Caroline and Konovalov, Vasily and Muti, Arianna and Liebeskind, Chaya and Litvak, Marina and Nozza, Debora, and Shah Khan, Shehryaar and Takeshita, Sotaro and Vanetik, Natalia and Ayele, Abinew Ali and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander}, booktitle={Working Notes of CLEF 2025 -- Conference and Labs of the Evaluation Forum}, editor={Guglielmo Faggioli and Nicola Ferro and Paolo Rosso and Damiano Spina}, month = sep, publisher = {CEUR-WS.org}, series = {CEUR Workshop Proceedings}, site = {Vienna, Austria}, url = {https://ceur-ws.org/Vol-4038/paper_278.pdf}, year = 2025 } ```