No community 4-bit quantization of Devstral Small 2 24B works on vLLM v0.15.1 β Ministral3ForCausalLM missing from registry
Summary
The official FP8 model (mistralai/Devstral-Small-2-24B-Instruct-2512) works correctly on vllm/vllm-openai:v0.15.1 via config-format: "mistral" + load-format: "mistral" + tokenizer-mode: "mistral". However, every community 4-bit quantization (AWQ, GPTQ, NVFP4, BitsAndBytes NF4) of this model produces gibberish on long outputs when served via vLLM v0.15.1. Short responses (under ~50 tokens) and tool calls work correctly, but anything longer degenerates into repetitive loops and eventually disconnected words.
This is not a quantization quality issue. It is caused by vLLM v0.15.1 loading the wrong text backbone class.
Full investigation with exact code paths, config comparisons, and a survey of all quantized models: GIBBERISH_BUG_REPORT.md
Verbatim gibberish output
Model: cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit on vllm/vllm-openai:v0.15.1.
Prompt: "Write a Python function that implements binary search on a sorted list. Include type hints and a docstring."
Here's a Python function that implements binary search on a sorted list. The function includes type hints and a docstring.
```python
from typing import TypeVar, List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
Without llama_4_scaling patched into the config, the degeneration is even worse β the model echoes system prompt fragments ("a software engineer. I am a senior software engineer who works at a company") and emits streams of punctuation.
The three-bug chain
Bug 1: Mistral-native config path misroutes to PixtralForConditionalGeneration
The Mistral config adapter in vllm/transformers_utils/configs/mistral.py unconditionally routes any model with vision_encoder in params.json to PixtralForConditionalGeneration. Since Devstral Small 2 24B inherits a vision encoder from its Mistral Small 3.1 24B Instruct base model, this crashes with KeyError: 'merging_layer.weight'. Open upstream issue: vllm-project/vllm#29904.
Bug 2: transformers v4.57.6 does not recognize ministral3
The model's config.json declares text_config.model_type: "ministral3". transformers v4.57.6 (bundled in vLLM v0.15.1, pinned to < 5) does not recognize this type β KeyError: 'ministral3'. The ministral3 type was added in transformers v5.0.0 (January 26, 2026). Workaround: patch text_config.model_type to "mistral" in a config override file.
Bug 3: Pixtral-12B special case forces wrong text backbone
In vllm/model_executor/models/mistral3.py, Mistral3ForConditionalGeneration.__init__ has:
# NOTE: These are special cases for Pixtral-12B in the HF-format
if (
config.text_config.architectures is None
and config.text_config.model_type == "mistral"
):
config.text_config.architectures = ["MistralForCausalLM"]
After patching model_type to "mistral" (Bug 2 workaround), and because the AWQ models' config.json does not set text_config.architectures, both conditions are TRUE. vLLM loads MistralForCausalLM (the old Mistral 7B architecture, inheriting from LlamaForCausalLM) instead of Ministral3ForCausalLM.
Ministral3ForCausalLM does not exist in vLLM v0.15.1's model registry:
# Verified inside the vllm/vllm-openai:v0.15.1 container:
# MistralForCausalLM -> ('mistral', 'MistralForCausalLM')
# Mistral3ForConditionalGeneration -> ('mistral3', 'Mistral3ForConditionalGeneration')
# MistralLarge3ForCausalLM -> ('mistral_large_3', 'MistralLarge3ForCausalLM')
# NO Ministral3ForCausalLM entry
Why only quantized models are affected
The official FP8 model ships both weight formats:
consolidated-*.safetensors(Mistral-native format, 2 files, ~25.8 GiB)model-*.safetensors(HuggingFace sharded format, 6 files, ~25.8 GiB)
The Mistral-native loading path (config-format: "mistral" + load-format: "mistral") reads params.json instead of config.json, avoids all three bugs, and correctly routes through Mistral3ForConditionalGeneration. But community AWQ/GPTQ quantizations produced by vllm-project/llm-compressor only ship HuggingFace sharded safetensors β they have no consolidated-*.safetensors files. This forces the HuggingFace config path, which triggers the bug chain.
Affected quantized models (all of them)
| Model | Format | Status on vLLM v0.15.1 |
|---|---|---|
cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit |
AWQ INT4 (group_size 32) | Gibberish on long outputs (tested) |
androiddrew/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit |
AWQ INT4 (group_size 128) | Same bug expected (same config.json structure) |
btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ |
Mixed GPTQ (INT4/INT8) | Same bug + too large (~24 GiB) |
Firworks/Devstral-Small-2-24B-Instruct-2512-nvfp4 |
NVFP4 | Does not load at all |
mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit |
Apple MLX 4-bit | Also reports gibberish (different framework, likely related architecture handling issue) |
What would fix this
On the vLLM side:
- Merging vllm-project/vllm#30566 (bump to transformers v5, open since December 12, 2025, not merged as of February 6, 2026)
- Adding
Ministral3ForCausalLMto vLLM's model registry - Fixing the Mistral config adapter to correctly route models with
vision_encoderinparams.jsontoMistral3ForConditionalGenerationinstead ofPixtralForConditionalGeneration(vllm-project/vllm#29904)
Environment
- GPU: NVIDIA GeForce RTX 5090 (32 GiB GDDR7 VRAM)
- Docker image:
vllm/vllm-openai:v0.15.1(February 4, 2026) - transformers inside container: 4.57.6 (pinned to
>= 4.56.0, < 5) - OS: Ubuntu 24.04 LTS