Qwen3-4B-f16-GGUF
This is a GGUF-quantized version of the Qwen/Qwen3-4B language model โ a powerful 4-billion-parameter LLM from Alibaba's Qwen series, designed for strong reasoning, agentic workflows, and multilingual fluency on consumer-grade hardware.
Converted for use with llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.
Available Quantizations (from f16)
These variants were built from a f16 base model to ensure consistency across quant levels.
NEW: I have a custom model called Q3_HIFI, which is better than the standard Q3_K_M model. It is higher quality, smaller in size, and nearly the same speed as Q3_K_M.
It is listed under the 'f16' options because it's not an officially recognised type (at the moment).
Q3_HIFI
Pros:
- ๐ Best quality with lowest perplexity of 16.76 (7.2% better than Q3_K_M, 12.2% better than Q3_K_S)
- ๐ฆ Smaller than Q3_K_M (1.87 vs 1.93 GiB) while being significantly better quality
- ๐ฏ Uses intelligent layer-sensitive quantization (Q3_HIFI on sensitive layers, mixed q3_K/q4_K elsewhere)
- ๐ Most consistent results (lowest relative standard deviation in perplexity)
Cons:
- ๐ข Slowest inference at 215.1 TPS (5.5% slower than Q3_K_S)
- ๐ง Custom quantization may have less community support
Best for: Production deployments where output quality matters, tasks requiring accuracy (reasoning, coding, complex instructions), or when you want the best quality-to-size ratio.
You can read more about how it compares to Q3_K_M and Q3_K_S here: Q3_Quantization_Comparison.md
You can also view a cross-model comparison of the Q3_HIFI type here.
| Level | Speed | Size | Recommendation |
|---|---|---|---|
| Q2_K | โก Fastest | 1.9 GB | ๐จ DO NOT USE. Worst results from all the 4B models. |
| ๐ฅ Q3_K_S | โก Fast | 2.2 GB | ๐ฅ Runner up. A very good model for a wide range of queries. |
| ๐ฅ Q3_K_M | โก Fast | 2.4 GB | ๐ฅ Best overall model. Highly recommended for all query types. |
| Q4_K_S | ๐ Fast | 2.7 GB | A late showing in low-temperature queries. Probably not recommended. |
| Q4_K_M | ๐ Fast | 2.9 GB | A late showing in high-temperature queries. Probably not recommended. |
| Q5_K_S | ๐ข Medium | 3.3 GB | Did not appear in the top 3 for any question. Not recommended. |
| Q5_K_M | ๐ข Medium | 3.4 GB | A second place for a high-temperature question, probably not recommended. |
| Q6_K | ๐ Slow | 3.9 GB | Did not appear in the top 3 for any question. Not recommended. |
| ๐ฅ Q8_0 | ๐ Slow | 5.1 GB | ๐ฅ If you want to play it safe, this is a good option. Good results across a variety of questions. |
Why Use a 4B Model?
The Qwen3-4B model strikes a powerful balance between capability and efficiency, offering:
- Strong reasoning and language understandingโsignificantly more capable than sub-1B models
- Smooth CPU inference with moderate hardware (no high-end GPU required)
- Memory footprint under ~8GB when quantized (e.g., GGUF Q4_K_M or AWQ)
- Excellent price-to-performance ratio for local or edge deployment
Itโs ideal for:
- Local chatbots with contextual memory and richer responses
- On-device AI on laptops or mid-tier edge servers
- Lightweight RAG (Retrieval-Augmented Generation) applications
- Developers needing a capable yet manageable open-weight model
Choose Qwen3-4B when you need more intelligence than a tiny model can provideโbut still want to run offline, avoid cloud costs, or maintain full control over your AI stack.
Build notes
All of these models (including Q3_HIFI) where built using these commands:
mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j
NOTE: Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.
The quantisation for Q3_HIFI also used a 5000 chunk imatrix file for extra precision. You can re-use it here: Qwen3-4B-f16-imatrix-5000.gguf
You can use the Q3_HIFI GitHub repository to build it from source if you're interested (use the Q3_HIFI branch): https://github.com/geoffmunn/llama.cpp.
Model anaysis and rankings
NOTE: This analysis does not include Q3_HIFI.
I have run each of these models across 6 questions, and ranked them all based on the quality of the anwsers. Qwen3-4B-f16:Q3_K_M (or Qwen3-4B-f16:Q3_HIFI) is the best model across all question types, but if you want to play it safe with a higher precision model, then you could consider using Qwen3-4B-f16:Q8_0.
You can read the results here: Qwen3-4b-f16-analysis.md
If you find this useful, please give the project a โค๏ธ like.
Usage
Load this model using:
- OpenWebUI โ self-hosted AI interface with RAG & tools
- LM Studio โ desktop app with GPU support and chat templates
- GPT4All โ private, local AI chatbot (offline-first)
- Or directly via
llama.cpp
Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.
Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value.
In this case try these steps:
wget https://huggingface.co/geoffmunn/Qwen3-4B-f16/resolve/main/Qwen3-4B-f16%3AQ3_K_M.gguf(replace the quantised version with the one you want)nano Modelfileand enter these details (again, replacing Q3_K_M with the version you want):
FROM ./Qwen3-4B-f16:Q3_K_M.gguf
# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
The num_ctx value has been dropped to increase speed significantly.
- Then run this command:
ollama create Qwen3-4B-f16:Q3_K_M -f Modelfile
You will now see "Qwen3-4B-f16:Q3_K_M" in your Ollama model list.
These import steps are also useful if you want to customise the default parameters or system prompt.
Author
๐ค Geoff Munn (@geoffmunn)
๐ Hugging Face Profile
Disclaimer
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.
- Downloads last month
- 6,316
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit