Qwen3-4B-f16-GGUF

This is a GGUF-quantized version of the Qwen/Qwen3-4B language model โ€” a powerful 4-billion-parameter LLM from Alibaba's Qwen series, designed for strong reasoning, agentic workflows, and multilingual fluency on consumer-grade hardware.

Converted for use with llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.

Available Quantizations (from f16)

These variants were built from a f16 base model to ensure consistency across quant levels.

NEW: I have a custom model called Q3_HIFI, which is better than the standard Q3_K_M model. It is higher quality, smaller in size, and nearly the same speed as Q3_K_M.

It is listed under the 'f16' options because it's not an officially recognised type (at the moment).

Q3_HIFI

Pros:

  • ๐Ÿ† Best quality with lowest perplexity of 16.76 (7.2% better than Q3_K_M, 12.2% better than Q3_K_S)
  • ๐Ÿ“ฆ Smaller than Q3_K_M (1.87 vs 1.93 GiB) while being significantly better quality
  • ๐ŸŽฏ Uses intelligent layer-sensitive quantization (Q3_HIFI on sensitive layers, mixed q3_K/q4_K elsewhere)
  • ๐Ÿ“Š Most consistent results (lowest relative standard deviation in perplexity)

Cons:

  • ๐Ÿข Slowest inference at 215.1 TPS (5.5% slower than Q3_K_S)
  • ๐Ÿ”ง Custom quantization may have less community support

Best for: Production deployments where output quality matters, tasks requiring accuracy (reasoning, coding, complex instructions), or when you want the best quality-to-size ratio.

You can read more about how it compares to Q3_K_M and Q3_K_S here: Q3_Quantization_Comparison.md

You can also view a cross-model comparison of the Q3_HIFI type here.

Level Speed Size Recommendation
Q2_K โšก Fastest 1.9 GB ๐Ÿšจ DO NOT USE. Worst results from all the 4B models.
๐Ÿฅˆ Q3_K_S โšก Fast 2.2 GB ๐Ÿฅˆ Runner up. A very good model for a wide range of queries.
๐Ÿฅ‡ Q3_K_M โšก Fast 2.4 GB ๐Ÿฅ‡ Best overall model. Highly recommended for all query types.
Q4_K_S ๐Ÿš€ Fast 2.7 GB A late showing in low-temperature queries. Probably not recommended.
Q4_K_M ๐Ÿš€ Fast 2.9 GB A late showing in high-temperature queries. Probably not recommended.
Q5_K_S ๐Ÿข Medium 3.3 GB Did not appear in the top 3 for any question. Not recommended.
Q5_K_M ๐Ÿข Medium 3.4 GB A second place for a high-temperature question, probably not recommended.
Q6_K ๐ŸŒ Slow 3.9 GB Did not appear in the top 3 for any question. Not recommended.
๐Ÿฅ‰ Q8_0 ๐ŸŒ Slow 5.1 GB ๐Ÿฅ‰ If you want to play it safe, this is a good option. Good results across a variety of questions.

Why Use a 4B Model?

The Qwen3-4B model strikes a powerful balance between capability and efficiency, offering:

  • Strong reasoning and language understandingโ€”significantly more capable than sub-1B models
  • Smooth CPU inference with moderate hardware (no high-end GPU required)
  • Memory footprint under ~8GB when quantized (e.g., GGUF Q4_K_M or AWQ)
  • Excellent price-to-performance ratio for local or edge deployment

Itโ€™s ideal for:

  • Local chatbots with contextual memory and richer responses
  • On-device AI on laptops or mid-tier edge servers
  • Lightweight RAG (Retrieval-Augmented Generation) applications
  • Developers needing a capable yet manageable open-weight model

Choose Qwen3-4B when you need more intelligence than a tiny model can provideโ€”but still want to run offline, avoid cloud costs, or maintain full control over your AI stack.

Build notes

All of these models (including Q3_HIFI) where built using these commands:

mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j 

NOTE: Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.

The quantisation for Q3_HIFI also used a 5000 chunk imatrix file for extra precision. You can re-use it here: Qwen3-4B-f16-imatrix-5000.gguf

You can use the Q3_HIFI GitHub repository to build it from source if you're interested (use the Q3_HIFI branch): https://github.com/geoffmunn/llama.cpp.

Model anaysis and rankings

NOTE: This analysis does not include Q3_HIFI.

I have run each of these models across 6 questions, and ranked them all based on the quality of the anwsers. Qwen3-4B-f16:Q3_K_M (or Qwen3-4B-f16:Q3_HIFI) is the best model across all question types, but if you want to play it safe with a higher precision model, then you could consider using Qwen3-4B-f16:Q8_0.

You can read the results here: Qwen3-4b-f16-analysis.md

If you find this useful, please give the project a โค๏ธ like.

Usage

Load this model using:

  • OpenWebUI โ€“ self-hosted AI interface with RAG & tools
  • LM Studio โ€“ desktop app with GPU support and chat templates
  • GPT4All โ€“ private, local AI chatbot (offline-first)
  • Or directly via llama.cpp

Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.

Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value. In this case try these steps:

  1. wget https://huggingface.co/geoffmunn/Qwen3-4B-f16/resolve/main/Qwen3-4B-f16%3AQ3_K_M.gguf (replace the quantised version with the one you want)
  2. nano Modelfile and enter these details (again, replacing Q3_K_M with the version you want):
FROM ./Qwen3-4B-f16:Q3_K_M.gguf

# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant

TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096

The num_ctx value has been dropped to increase speed significantly.

  1. Then run this command: ollama create Qwen3-4B-f16:Q3_K_M -f Modelfile

You will now see "Qwen3-4B-f16:Q3_K_M" in your Ollama model list.

These import steps are also useful if you want to customise the default parameters or system prompt.

Author

๐Ÿ‘ค Geoff Munn (@geoffmunn)
๐Ÿ”— Hugging Face Profile

Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.

Downloads last month
6,316
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for geoffmunn/Qwen3-4B-f16

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(173)
this model