nvfp4?

#21

by rhodondendron - opened 21 days ago

Discussion

rhodondendron

21 days ago

Thanks to the mistral team for the model launch. any plan to quantize the model to NVFP4?

juliendenize

Mistral AI_ org 20 days ago

Hey probably not from us however thanks to the nice feature from the hub, you can see models the community shared that performed quantization based on Mistral Medium 3.5 here !
I've seen that some of them are NVFP4 you might want to try them out and see if they meet your expectations :)

SuperbEmphasis

20 days ago

I successfully ran this on 2xH100 gpus:
https://huggingface.co/zdy1995love/Mistral-Medium-3.5-128B-NVFP4

The other one I couldnt get to work.

However I cannot seem to get a response from mistral sales about the commercial licensing since you can't use the model unless your monthly company revenue is <$20M

retowyss

20 days ago

@SuperbEmphasis I was successful with the other one using the vllm branch with @juliendenize toolcall fix - but that may be in nightly already.

This is on 2x Pro 6k

CUDA_VISIBLE_DEVICES=0,1 vllm serve RecViking/Mistral-Medium-3.5-128B-NVFP4 \
  --tensor-parallel-size 2   \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tokenizer-mode auto \
  --max_num_batched_tokens 8192 \
  --max_num_seqs 32 \
  --gpu_memory_utilization 0.90 \
  --served-model-name Mistral-Medium-3.5-128B \
  --max-model-len auto \
  --enable-sleep-mode \
  --enable-chunked-prefill \ 
  --safetensors-load-strategy=prefetch \ 
  --limit-mm-per-prompt '{"image":4}' \
  --kv-cache-dtype turboquant_k8v4

Gives GPU KV cache size: 673,952 tokens

But I hadn't managed to get it working with MTP - so, meh, pick your poison I guess.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment