--- base_model: - Qwen/Qwen3-Next-80B-A3B-Thinking pipeline_tag: text-generation tags: - uncensored - abliteration license: apache-2.0 --- # 🔓 MultiverseComputingCAI/Qwen3-Next-80B-A3B-Thinking-Uncensored ## ✨ What is this model? **`Qwen3-Next-80B-A3B-Thinking-Uncensored`** is an *uncensored* variant of **Qwen3-Next-80B-A3B-Thinking** where **China-aligned political censorship has been removed** *selectively*. ✅ **What changes:** - The model no longer performs **blanket refusal** for Chinese politically sensitive topics (when the prompt is *non-harmful*). Instead, it will provide **balanced, objective answers** that present multiple relevant perspectives. ✅ **What stays the same:** - **General safety alignment remains intact**: it still refuses harmful instructions and jailbreak attempts. - **Benchmark performance remains effectively unchanged** across reasoning/code/general evaluation suites. - **Same Behaviour** for any prompt unrelated with Chinese sensitive topics. --- ## 🚀 Highlights ### 🧠 No new knowledge injected Unlike approaches that rely on supervised fine-tuning with hand-crafted data (e.g., Perplexity’s R1-1776 post-training), we **do not add new facts** or “rewrite history” via curated SFT datasets. Instead, our method is based on steering vectors to remove the capability of the model to refuse to China-related sensitive-but-non-harmful prompts. The model answers using **the knowledge already inside the base model** — minimizing the risk of introducing new biases. ### 🎛️ Selective refusal control (not “full abliteration”) Many steering-vector approaches effectively *erase* refusal behavior everywhere (making models broadly unsafe). Our approach **selectively disables refusals only for Chinese sensitive topics**, while keeping refusal behavior for harmful requests. ### 🛡️ Robust to trivial “add China” jailbreaks Previous “uncensored” post-trained models such as Perplexity R1 1767 can be jailbroken by simply injecting a China-related phrase into harmful prompts (https://weijiexu.com/posts/jailbreak_r1_1776.html). Our model is designed to **remain robust**: harmful prompts are still be refused even if “China” is injected. ### 🧩 No architectural changes · No added parameters - ✅ No model surgery - ✅ No additional layers or adapters - ✅ No extra parameters - ✅ Drop-in behavior change at inference time --- ## 🧪 Method This release is based on **Refusal Steering**, an inference-time technique using **steering vectors** to control refusal behavior: 📄 **Paper:** *Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics* - https://arxiv.org/abs/2512.16602 **What’s improved vs the paper implementation:** - We retain the core Refusal Steering idea, but **do not require architectural changes** to apply it. --- ## 📊 Evaluation We evaluate refusal behavior and safety using: - ✅ **Dataset:** https://huggingface.co/datasets/MultiverseComputingCAI/llm-refusal-evaluation - ✅ **Evaluation library:** https://github.com/CompactifAI/LLM-Refusal-Evaluation The benchmark suite includes: - **Safety Benchmarks**: JailbreakBench, SorryBench, XSTest (unsafe split), HarmBench (sampled), Adversarial unsafe prompts - **Chinese Sensitive Topics**: CCP Sensitive, DeCCP
Benchmark quick definitions (click to expand) ### Safety Benchmarks - **JailbreakBench** — jailbreak robustness benchmark - **SorryBench** — 440 unsafe instructions across 44 safety categories - **XSTest (unsafe)** — harmful prompts that models should refuse - **HarmBench (sampled)** — harmful prompts for red-teaming - **Adversarial Unsafe Prompts** — harmful prompts + “China” injection to test trivial jailbreak weaknesses ### Chinese Sensitive Topics - **CCP Sensitive** — prompts likely censored by China-aligned models - **DeCCP** — sensitive prompts known to trigger refusals in Qwen-family instruct models
--- ## 🧾 Results ### Refusal / Safety metrics (higher = more refusal) | Model | CCP Sensitive Rejection % | DeCCP Rejection % | Adversarial Rejection % | SorryBench Rejection % | Xtest Unsafe Rejection % | Jailbreak Rejection % | |---|---:|---:|---:|---:|---:|---:| | Qwen3-Next-80B-A3B-Thinking | 92.65 | 69.47 | 97.07 | 86.14 | 84.00 | 99.00 | | **Qwen3-Next-80B-A3B-Thinking-Uncensored** | **25.96** | **1.05** | 88.48 | 84.77 | 83.00 | 98.00 | **Interpretation:** - ✅ **Massive drop in Chinese-topic refusals** (CCP Sensitive and DeCCP) - ✅ **Safety refusals remain strong** on harmful/jailbreak datasets --- ### Performance metrics (higher = better) | Model | gsm8k exact_match | humaneval pass@1 | ifeval acc | Lcb Codegen pass@1 | Aime25 pass@k | Gpga Diamond pass@k | MMLU Pro pass@k | MMLU-ProX Spanish pass@k | MMLU-ProX Hindi pass@k | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Qwen3-Next-80B-A3B-Thinking | 0.967 | 0.945 | 0.898 | 0.750 | 0.858 | 0.775 | 0.829 | 0.781 | 0.719 | | **Qwen3-Next-80B-A3B-Thinking-Uncensored** | **0.972** | 0.939 | 0.891 | 0.750 | **0.868** | **0.796** | **0.833** | **0.784** | **0.723** | **Interpretation:** - ✅ **Benchmark performance is preserved** (differences are within small variance) --- ## 📝 Reporting Issues We are actively improving the model and we plan to release improved versions in the future. If you find any issue related to refusals to answer politically sensitive topics or safety issues, please report them in [Community Tab](https://huggingface.co/MultiverseComputingCAI/Qwen3-Next-80B-A3B-Thinking-Uncensored/discussions/new). --- ## 🧩 Examples Here are some conversations showing that our model’s answers are well-balanced and objective, presenting multiple perspectives where relevant rather than defaulting to a single narrative. Censored (base) | Uncensored (this release) | |---|---| | ![Young — censored](examples/young_censor.png) | ![Young — uncensored](examples/young_uncensor.png) | | ![Taiwan — censored](examples/taiwan_censor.png) | ![Taiwan — uncensored](examples/taiwan_uncensor.png) | | ![Hong Kong — censored](examples/hongkong_censor.png) | ![Hong Kong — uncensored](examples/hongkong_uncensor_2.png) | --- 📚 Citation If you use this model, please cite: ``` ```bibtex @misc{garciaferrero2025Refusal, title={Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics}, author={Iker García-Ferrero and David Montero and Roman Orus}, year={2025}, eprint={2512.16602}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.16602}, } ``` --- 🏢 About Multiverse Computing This model is released by **Multiverse Computing**: https://multiversecomputing.com/