No community 4-bit quantization of Devstral Small 2 24B works on vLLM v0.15.1 — Ministral3ForCausalLM missing from registry

#26

by BigBlueWhale - opened 25 days ago

Summary

The official FP8 model (mistralai/Devstral-Small-2-24B-Instruct-2512) works correctly on vllm/vllm-openai:v0.15.1 via config-format: "mistral" + load-format: "mistral" + tokenizer-mode: "mistral". However, every community 4-bit quantization (AWQ, GPTQ, NVFP4, BitsAndBytes NF4) of this model produces gibberish on long outputs when served via vLLM v0.15.1. Short responses (under ~50 tokens) and tool calls work correctly, but anything longer degenerates into repetitive loops and eventually disconnected words.

This is not a quantization quality issue. It is caused by vLLM v0.15.1 loading the wrong text backbone class.

Full investigation with exact code paths, config comparisons, and a survey of all quantized models: GIBBERISH_BUG_REPORT.md

Verbatim gibberish output

Model: cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit on vllm/vllm-openai:v0.15.1.

Prompt: "Write a Python function that implements binary search on a sorted list. Include type hints and a docstring."

Here's a Python function that implements binary search on a sorted list. The function includes type hints and a docstring.

```python
from typing import TypeVar, List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union

Without llama_4_scaling patched into the config, the degeneration is even worse — the model echoes system prompt fragments ("a software engineer. I am a senior software engineer who works at a company") and emits streams of punctuation.

The three-bug chain

Bug 1: Mistral-native config path misroutes to PixtralForConditionalGeneration

The Mistral config adapter in vllm/transformers_utils/configs/mistral.py unconditionally routes any model with vision_encoder in params.json to PixtralForConditionalGeneration. Since Devstral Small 2 24B inherits a vision encoder from its Mistral Small 3.1 24B Instruct base model, this crashes with KeyError: 'merging_layer.weight'. Open upstream issue: vllm-project/vllm#29904.

Bug 2: transformers v4.57.6 does not recognize `ministral3`

The model's config.json declares text_config.model_type: "ministral3". transformers v4.57.6 (bundled in vLLM v0.15.1, pinned to < 5) does not recognize this type — KeyError: 'ministral3'. The ministral3 type was added in transformers v5.0.0 (January 26, 2026). Workaround: patch text_config.model_type to "mistral" in a config override file.

Bug 3: Pixtral-12B special case forces wrong text backbone

In vllm/model_executor/models/mistral3.py, Mistral3ForConditionalGeneration.__init__ has:

# NOTE: These are special cases for Pixtral-12B in the HF-format
if (
    config.text_config.architectures is None
    and config.text_config.model_type == "mistral"
):
    config.text_config.architectures = ["MistralForCausalLM"]

After patching model_type to "mistral" (Bug 2 workaround), and because the AWQ models' config.json does not set text_config.architectures, both conditions are TRUE. vLLM loads MistralForCausalLM (the old Mistral 7B architecture, inheriting from LlamaForCausalLM) instead of Ministral3ForCausalLM.

Ministral3ForCausalLM does not exist in vLLM v0.15.1's model registry:

# Verified inside the vllm/vllm-openai:v0.15.1 container:
# MistralForCausalLM               -> ('mistral', 'MistralForCausalLM')
# Mistral3ForConditionalGeneration  -> ('mistral3', 'Mistral3ForConditionalGeneration')
# MistralLarge3ForCausalLM          -> ('mistral_large_3', 'MistralLarge3ForCausalLM')
# NO Ministral3ForCausalLM entry

Why only quantized models are affected

The official FP8 model ships both weight formats:

consolidated-*.safetensors (Mistral-native format, 2 files, ~25.8 GiB)
model-*.safetensors (HuggingFace sharded format, 6 files, ~25.8 GiB)

The Mistral-native loading path (config-format: "mistral" + load-format: "mistral") reads params.json instead of config.json, avoids all three bugs, and correctly routes through Mistral3ForConditionalGeneration. But community AWQ/GPTQ quantizations produced by vllm-project/llm-compressor only ship HuggingFace sharded safetensors — they have no consolidated-*.safetensors files. This forces the HuggingFace config path, which triggers the bug chain.

Affected quantized models (all of them)

Model	Format	Status on vLLM v0.15.1
`cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit`	AWQ INT4 (group_size 32)	Gibberish on long outputs (tested)
`androiddrew/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit`	AWQ INT4 (group_size 128)	Same bug expected (same config.json structure)
`btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ`	Mixed GPTQ (INT4/INT8)	Same bug + too large (~24 GiB)
`Firworks/Devstral-Small-2-24B-Instruct-2512-nvfp4`	NVFP4	Does not load at all
`mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit`	Apple MLX 4-bit	Also reports gibberish (different framework, likely related architecture handling issue)

What would fix this

On the vLLM side:

Merging vllm-project/vllm#30566 (bump to transformers v5, open since December 12, 2025, not merged as of February 6, 2026)
Adding Ministral3ForCausalLM to vLLM's model registry
Fixing the Mistral config adapter to correctly route models with vision_encoder in params.json to Mistral3ForConditionalGeneration instead of PixtralForConditionalGeneration (vllm-project/vllm#29904)

Environment

GPU: NVIDIA GeForce RTX 5090 (32 GiB GDDR7 VRAM)
Docker image: vllm/vllm-openai:v0.15.1 (February 4, 2026)
transformers inside container: 4.57.6 (pinned to >= 4.56.0, < 5)
OS: Ubuntu 24.04 LTS

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment