nvfp4?
Thanks to the mistral team for the model launch. any plan to quantize the model to NVFP4?
Hey probably not from us however thanks to the nice feature from the hub, you can see models the community shared that performed quantization based on Mistral Medium 3.5 here !
I've seen that some of them are NVFP4 you might want to try them out and see if they meet your expectations :)
I successfully ran this on 2xH100 gpus:
https://huggingface.co/zdy1995love/Mistral-Medium-3.5-128B-NVFP4
The other one I couldnt get to work.
However I cannot seem to get a response from mistral sales about the commercial licensing since you can't use the model unless your monthly company revenue is <$20M
@SuperbEmphasis I was successful with the other one using the vllm branch with @juliendenize toolcall fix - but that may be in nightly already.
This is on 2x Pro 6k
CUDA_VISIBLE_DEVICES=0,1 vllm serve RecViking/Mistral-Medium-3.5-128B-NVFP4 \
--tensor-parallel-size 2 \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--tokenizer-mode auto \
--max_num_batched_tokens 8192 \
--max_num_seqs 32 \
--gpu_memory_utilization 0.90 \
--served-model-name Mistral-Medium-3.5-128B \
--max-model-len auto \
--enable-sleep-mode \
--enable-chunked-prefill \
--safetensors-load-strategy=prefetch \
--limit-mm-per-prompt '{"image":4}' \
--kv-cache-dtype turboquant_k8v4
Gives GPU KV cache size: 673,952 tokens
But I hadn't managed to get it working with MTP - so, meh, pick your poison I guess.