nvfp4?

#21
by rhodondendron - opened

Thanks to the mistral team for the model launch. any plan to quantize the model to NVFP4?

Mistral AI_ org

Hey probably not from us however thanks to the nice feature from the hub, you can see models the community shared that performed quantization based on Mistral Medium 3.5 here !
I've seen that some of them are NVFP4 you might want to try them out and see if they meet your expectations :)

I successfully ran this on 2xH100 gpus:
https://huggingface.co/zdy1995love/Mistral-Medium-3.5-128B-NVFP4

The other one I couldnt get to work.

However I cannot seem to get a response from mistral sales about the commercial licensing since you can't use the model unless your monthly company revenue is <$20M

@SuperbEmphasis I was successful with the other one using the vllm branch with @juliendenize toolcall fix - but that may be in nightly already.

This is on 2x Pro 6k

CUDA_VISIBLE_DEVICES=0,1 vllm serve RecViking/Mistral-Medium-3.5-128B-NVFP4 \
  --tensor-parallel-size 2   \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tokenizer-mode auto \
  --max_num_batched_tokens 8192 \
  --max_num_seqs 32 \
  --gpu_memory_utilization 0.90 \
  --served-model-name Mistral-Medium-3.5-128B \
  --max-model-len auto \
  --enable-sleep-mode \
  --enable-chunked-prefill \ 
  --safetensors-load-strategy=prefetch \ 
  --limit-mm-per-prompt '{"image":4}' \
  --kv-cache-dtype turboquant_k8v4

Gives GPU KV cache size: 673,952 tokens

But I hadn't managed to get it working with MTP - so, meh, pick your poison I guess.

Sign up or log in to comment