File name changes
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +7 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_code.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_general.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_math.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_code.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_general.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_math.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_code.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_general.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_math.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log +172 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_code.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_general.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_math.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_code.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_general.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_math.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_code.log +174 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_general.log +174 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_math.log +174 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md +11 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log +173 -0
- Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log +173 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
Qwen3-4B-Instruct-2507-IQ4_NL.gguf filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
Qwen3-4B-Instruct-2507-mxfp4_moe-EQUD-IQ4NL-KO-MXFP4.gguf filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
Qwen3-4B-Instruct-2507-mxfp4_moe-K-B16-QO-Q6K-EUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
Qwen3-4B-Instruct-2507-mxfp4_moe-K-B16-QU-IQ4NL-O-MXFP4-E-Q5K-D-Q6K.gguf filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
Qwen3-4B-Instruct-2507-mxfp4_moe-O-Q5K-EQKUD-Q6K.gguf filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
Qwen3-4B-Instruct-2507-mxfp4_moe-QUD-IQ4NL-KO-MXFP4-E-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "2.20 GiB",
|
| 13 |
+
"t/s": "468.87 \u00b1 9.10",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 468.87
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 468.87
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5649,
|
| 24 |
+
"ppl_error": 0.01235
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_general.log",
|
| 28 |
+
"ppl": 9.0168,
|
| 29 |
+
"ppl_error": 0.20767
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_math.log",
|
| 33 |
+
"ppl": 6.7694,
|
| 34 |
+
"ppl_error": 0.13664
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 1.1921,
|
| 40 |
+
"bench_tps": 468.87,
|
| 41 |
+
"file_size_bytes": 2369546976,
|
| 42 |
+
"file_size_gb": 2.21
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B IQ4_NL - 4.5 bpw | 2.20 GiB | 4.02 B | CUDA | 35 | pp8 | 468.87 ± 9.10 |
|
| 9 |
+
| qwen3 4B IQ4_NL - 4.5 bpw | 2.20 GiB | 4.02 B | CUDA | 35 | tg128 | 88.29 ± 0.84 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_code.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20307 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 49 |
+
llama_model_loader: - type iq4_nl: 252 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 52 |
+
print_info: file size = 2.20 GiB (4.70 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 1170.87 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 541.61 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 541.61 MiB
|
| 125 |
+
........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 109.937 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 0.97 seconds per pass - ETA 0.70 minutes
|
| 162 |
+
[1]3.2056,[2]2.5386,[3]1.8624,[4]1.7201,[5]1.8522,[6]1.9004,[7]1.8533,[8]1.8176,[9]1.7295,[10]1.6713,[11]1.6343,[12]1.6380,[13]1.6018,[14]1.5773,[15]1.6019,[16]1.5786,[17]1.5624,[18]1.5688,[19]1.5533,[20]1.5341,[21]1.5268,[22]1.5223,[23]1.5446,[24]1.5302,[25]1.5352,[26]1.5173,[27]1.5073,[28]1.5053,[29]1.5209,[30]1.5248,[31]1.5144,[32]1.5037,[33]1.5061,[34]1.5035,[35]1.5024,[36]1.5297,[37]1.5402,[38]1.5472,[39]1.5547,[40]1.5551,[41]1.5484,[42]1.5626,[43]1.5635,[44]1.5649,
|
| 163 |
+
Final estimate: PPL = 1.5649 +/- 0.01235
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 535.05 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 32937.91 ms / 90112 tokens ( 0.37 ms per token, 2735.81 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 34131.57 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18973 + (1227 = 541 + 80 + 606) + 3914 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22266 + ( 695 = 541 + 80 + 74) + 1162 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 1307 = 1170 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_general.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20300 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 49 |
+
llama_model_loader: - type iq4_nl: 252 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 52 |
+
print_info: file size = 2.20 GiB (4.70 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 1170.87 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 541.61 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 541.61 MiB
|
| 125 |
+
........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 46.037 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 0.98 seconds per pass - ETA 0.23 minutes
|
| 162 |
+
[1]8.3220,[2]10.4926,[3]10.8352,[4]10.5193,[5]10.2783,[6]8.7451,[7]7.8393,[8]7.8188,[9]8.2655,[10]8.3966,[11]8.4139,[12]8.7684,[13]8.8202,[14]8.9552,[15]9.0168,
|
| 163 |
+
Final estimate: PPL = 9.0168 +/- 0.20767
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 539.80 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 11398.36 ms / 30720 tokens ( 0.37 ms per token, 2695.12 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 11814.19 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18979 + (1227 = 541 + 80 + 606) + 3907 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22266 + ( 695 = 541 + 80 + 74) + 1162 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 1307 = 1170 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL/perplexity_math.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20304 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 49 |
+
llama_model_loader: - type iq4_nl: 252 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 52 |
+
print_info: file size = 2.20 GiB (4.70 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 1170.87 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 541.61 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 541.61 MiB
|
| 125 |
+
........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 44.195 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 0.97 seconds per pass - ETA 0.25 minutes
|
| 162 |
+
[1]5.7561,[2]6.3649,[3]6.5996,[4]6.7573,[5]6.9766,[6]6.9085,[7]6.8450,[8]6.7376,[9]6.7476,[10]6.6983,[11]6.7241,[12]6.7065,[13]6.7872,[14]6.7934,[15]6.7853,[16]6.7694,
|
| 163 |
+
Final estimate: PPL = 6.7694 +/- 0.13664
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 539.76 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 12072.70 ms / 32768 tokens ( 0.37 ms per token, 2714.22 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 12509.73 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18980 + (1227 = 541 + 80 + 606) + 3906 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22266 + ( 695 = 541 + 80 + 74) + 1162 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 1307 = 1170 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "2.58 GiB",
|
| 13 |
+
"t/s": "421.49 \u00b1 3.64",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 421.49
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 421.49
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5575,
|
| 24 |
+
"ppl_error": 0.0124
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_general.log",
|
| 28 |
+
"ppl": 9.0853,
|
| 29 |
+
"ppl_error": 0.21171
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_math.log",
|
| 33 |
+
"ppl": 6.8722,
|
| 34 |
+
"ppl_error": 0.14187
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 1.8004,
|
| 40 |
+
"bench_tps": 421.49,
|
| 41 |
+
"file_size_bytes": 2772053216,
|
| 42 |
+
"file_size_gb": 2.58
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B IQ4_NL - 4.5 bpw | 2.58 GiB | 4.02 B | CUDA | 35 | pp8 | 421.49 ± 3.64 |
|
| 9 |
+
| qwen3 4B IQ4_NL - 4.5 bpw | 2.58 GiB | 4.02 B | CUDA | 35 | tg128 | 90.39 ± 0.91 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_code.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20035 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q5_K: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 51 |
+
print_info: file size = 2.58 GiB (5.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 1314.11 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 661.92 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 661.92 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 556.77 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 110.99 ms
|
| 159 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.04 seconds per pass - ETA 0.75 minutes
|
| 161 |
+
[1]3.1459,[2]2.4764,[3]1.8312,[4]1.6879,[5]1.8171,[6]1.8737,[7]1.8269,[8]1.7987,[9]1.7145,[10]1.6564,[11]1.6220,[12]1.6245,[13]1.5899,[14]1.5659,[15]1.5884,[16]1.5664,[17]1.5538,[18]1.5599,[19]1.5446,[20]1.5252,[21]1.5179,[22]1.5138,[23]1.5352,[24]1.5214,[25]1.5274,[26]1.5097,[27]1.5003,[28]1.4982,[29]1.5134,[30]1.5169,[31]1.5064,[32]1.4954,[33]1.4978,[34]1.4952,[35]1.4945,[36]1.5223,[37]1.5326,[38]1.5389,[39]1.5461,[40]1.5473,[41]1.5408,[42]1.5550,[43]1.5567,[44]1.5575,
|
| 162 |
+
Final estimate: PPL = 1.5575 +/- 0.01240
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 590.08 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 36077.82 ms / 90112 tokens ( 0.40 ms per token, 2497.71 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 37270.20 ms / 90113 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18641 + (1298 = 661 + 80 + 556) + 4175 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22146 + ( 815 = 661 + 80 + 74) + 1162 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 1451 = 1314 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_general.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20037 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q5_K: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 51 |
+
print_info: file size = 2.58 GiB (5.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 1314.11 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 661.92 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 661.92 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 556.77 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 48.02 ms
|
| 159 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.04 seconds per pass - ETA 0.25 minutes
|
| 161 |
+
[1]8.3552,[2]10.6053,[3]10.9507,[4]10.5876,[5]10.3214,[6]8.7948,[7]7.8974,[8]7.8830,[9]8.3180,[10]8.4666,[11]8.5002,[12]8.8508,[13]8.8980,[14]9.0362,[15]9.0853,
|
| 162 |
+
Final estimate: PPL = 9.0853 +/- 0.21171
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 578.36 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 12445.33 ms / 30720 tokens ( 0.41 ms per token, 2468.40 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 12864.87 ms / 30721 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18648 + (1298 = 661 + 80 + 556) + 4168 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22146 + ( 815 = 661 + 80 + 74) + 1162 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 1451 = 1314 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K/perplexity_math.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20037 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q5_K-attn_output_Q5_K-attn_q_Q5_K-embeddings_Q5_K-ffn_down_Q5_K-ffn_up_gate_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q5_K: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 51 |
+
print_info: file size = 2.58 GiB (5.50 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 1314.11 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 661.92 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 661.92 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 556.77 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 43.208 ms
|
| 159 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.04 seconds per pass - ETA 0.27 minutes
|
| 161 |
+
[1]5.5686,[2]6.2239,[3]6.5091,[4]6.7071,[5]6.9573,[6]6.9005,[7]6.8707,[8]6.7695,[9]6.8001,[10]6.7571,[11]6.8019,[12]6.7915,[13]6.8825,[14]6.8848,[15]6.8838,[16]6.8722,
|
| 162 |
+
Final estimate: PPL = 6.8722 +/- 0.14187
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 577.58 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 13318.00 ms / 32768 tokens ( 0.41 ms per token, 2460.43 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 13752.86 ms / 32769 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18645 + (1298 = 661 + 80 + 556) + 4170 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22146 + ( 815 = 661 + 80 + 74) + 1162 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 1451 = 1314 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "3.07 GiB",
|
| 13 |
+
"t/s": "418.79 \u00b1 8.87",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 418.79
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 418.79
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5452,
|
| 24 |
+
"ppl_error": 0.01212
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_general.log",
|
| 28 |
+
"ppl": 8.8441,
|
| 29 |
+
"ppl_error": 0.20336
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_math.log",
|
| 33 |
+
"ppl": 6.6952,
|
| 34 |
+
"ppl_error": 0.13573
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.2492,
|
| 40 |
+
"bench_tps": 418.79,
|
| 41 |
+
"file_size_bytes": 3306261216,
|
| 42 |
+
"file_size_gb": 3.08
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B IQ4_NL - 4.5 bpw | 3.07 GiB | 4.02 B | CUDA | 35 | pp8 | 418.79 ± 8.87 |
|
| 9 |
+
| qwen3 4B IQ4_NL - 4.5 bpw | 3.07 GiB | 4.02 B | CUDA | 35 | tg128 | 76.30 ± 1.03 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_code.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20306 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q6_K: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 51 |
+
print_info: file size = 3.07 GiB (6.56 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 1567.90 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 789.76 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 789.76 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 111.805 ms
|
| 159 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.15 seconds per pass - ETA 0.83 minutes
|
| 161 |
+
[1]3.1284,[2]2.4584,[3]1.8227,[4]1.6822,[5]1.7962,[6]1.8485,[7]1.8033,[8]1.7755,[9]1.6932,[10]1.6381,[11]1.6056,[12]1.6073,[13]1.5738,[14]1.5513,[15]1.5733,[16]1.5513,[17]1.5383,[18]1.5451,[19]1.5308,[20]1.5114,[21]1.5036,[22]1.5002,[23]1.5216,[24]1.5086,[25]1.5142,[26]1.4969,[27]1.4877,[28]1.4859,[29]1.5014,[30]1.5050,[31]1.4950,[32]1.4845,[33]1.4873,[34]1.4846,[35]1.4842,[36]1.5111,[37]1.5211,[38]1.5266,[39]1.5340,[40]1.5352,[41]1.5286,[42]1.5429,[43]1.5441,[44]1.5452,
|
| 162 |
+
Final estimate: PPL = 1.5452 +/- 0.01212
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 636.43 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 40305.31 ms / 90112 tokens ( 0.45 ms per token, 2235.74 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 41496.66 ms / 90113 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18730 + (1475 = 789 + 80 + 606) + 3908 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22018 + ( 943 = 789 + 80 + 74) + 1162 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 1704 = 1567 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_general.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20305 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q6_K: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 51 |
+
print_info: file size = 3.07 GiB (6.56 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 1567.90 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 789.76 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 789.76 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 46.443 ms
|
| 159 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.16 seconds per pass - ETA 0.28 minutes
|
| 161 |
+
[1]8.2481,[2]10.2801,[3]10.6454,[4]10.3029,[5]10.0287,[6]8.5719,[7]7.7014,[8]7.6887,[9]8.1140,[10]8.2569,[11]8.2735,[12]8.6063,[13]8.6458,[14]8.7777,[15]8.8441,
|
| 162 |
+
Final estimate: PPL = 8.8441 +/- 0.20336
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 642.77 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 13867.33 ms / 30720 tokens ( 0.45 ms per token, 2215.28 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 14281.18 ms / 30721 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18732 + (1475 = 789 + 80 + 606) + 3907 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22018 + ( 943 = 789 + 80 + 74) + 1162 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 1704 = 1567 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K/perplexity_math.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20299 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-4B-Instruct-2507-unsloth-iq4_nl-attn_kv_Q6_K-attn_output_Q6_K-attn_q_Q6_K-embeddings_Q6_K-ffn_down_Q6_K-ffn_up_gate_Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 25
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q6_K: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 51 |
+
print_info: file size = 3.07 GiB (6.56 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 1567.90 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 789.76 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 789.76 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 43.088 ms
|
| 159 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.14 seconds per pass - ETA 0.30 minutes
|
| 161 |
+
[1]5.5439,[2]6.1467,[3]6.4037,[4]6.5914,[5]6.8277,[6]6.7678,[7]6.7185,[8]6.6198,[9]6.6486,[10]6.5964,[11]6.6355,[12]6.6216,[13]6.7035,[14]6.7100,[15]6.7080,[16]6.6952,
|
| 162 |
+
Final estimate: PPL = 6.6952 +/- 0.13573
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 635.44 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 14811.76 ms / 32768 tokens ( 0.45 ms per token, 2212.30 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 15249.07 ms / 32769 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18462 + (1475 = 789 + 80 + 606) + 4176 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 22018 + ( 943 = 789 + 80 + 74) + 1162 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 1704 = 1567 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "7.49 GiB",
|
| 13 |
+
"t/s": "264.48 \u00b1 3.49",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 264.48
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 264.48
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5469,
|
| 24 |
+
"ppl_error": 0.01221
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log",
|
| 28 |
+
"ppl": 8.883,
|
| 29 |
+
"ppl_error": 0.20559
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log",
|
| 33 |
+
"ppl": 6.7086,
|
| 34 |
+
"ppl_error": 0.13691
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.0,
|
| 40 |
+
"bench_tps": 264.48,
|
| 41 |
+
"file_size_bytes": 8051285216,
|
| 42 |
+
"file_size_gb": 7.5
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 7.49 GiB | 4.02 B | CUDA | 35 | pp8 | 264.48 ± 3.49 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 7.49 GiB | 4.02 B | CUDA | 35 | tg128 | 33.65 ± 0.06 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20245 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 7.49 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 3822.21 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 114.531 ms
|
| 159 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.92 seconds per pass - ETA 1.40 minutes
|
| 161 |
+
[1]3.1376,[2]2.4676,[3]1.8269,[4]1.6835,[5]1.8008,[6]1.8530,[7]1.8077,[8]1.7792,[9]1.6971,[10]1.6416,[11]1.6086,[12]1.6100,[13]1.5762,[14]1.5537,[15]1.5752,[16]1.5533,[17]1.5403,[18]1.5472,[19]1.5330,[20]1.5133,[21]1.5054,[22]1.5018,[23]1.5232,[24]1.5104,[25]1.5159,[26]1.4986,[27]1.4894,[28]1.4877,[29]1.5031,[30]1.5066,[31]1.4966,[32]1.4860,[33]1.4888,[34]1.4862,[35]1.4855,[36]1.5124,[37]1.5227,[38]1.5284,[39]1.5357,[40]1.5368,[41]1.5303,[42]1.5446,[43]1.5461,[44]1.5469,
|
| 162 |
+
Final estimate: PPL = 1.5469 +/- 0.01221
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 1180.96 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 72619.45 ms / 90112 tokens ( 0.81 ms per token, 1240.88 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 73909.89 ms / 90113 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16989 + (3048 = 1925 + 80 + 1043) + 4077 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20866 + (2079 = 1925 + 80 + 74) + 1178 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 3959 = 3822 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20243 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 7.49 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 3822.21 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 48.155 ms
|
| 159 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.90 seconds per pass - ETA 0.47 minutes
|
| 161 |
+
[1]8.2645,[2]10.3767,[3]10.7396,[4]10.3982,[5]10.1196,[6]8.6313,[7]7.7465,[8]7.7286,[9]8.1491,[10]8.2848,[11]8.3082,[12]8.6423,[13]8.6786,[14]8.8123,[15]8.8830,
|
| 162 |
+
Final estimate: PPL = 8.8830 +/- 0.20559
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 1172.62 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 24915.93 ms / 30720 tokens ( 0.81 ms per token, 1232.95 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 25358.73 ms / 30721 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16995 + (3048 = 1925 + 80 + 1043) + 4071 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20866 + (2079 = 1925 + 80 + 74) + 1178 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 3959 = 3822 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20230 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 253 tensors
|
| 49 |
+
print_info: file format = GGUF V3 (latest)
|
| 50 |
+
print_info: file type = MXFP4 MoE
|
| 51 |
+
print_info: file size = 7.49 GiB (16.00 BPW)
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 151643 ('<|endoftext|>')
|
| 54 |
+
load: - 151645 ('<|im_end|>')
|
| 55 |
+
load: - 151662 ('<|fim_pad|>')
|
| 56 |
+
load: - 151663 ('<|repo_name|>')
|
| 57 |
+
load: - 151664 ('<|file_sep|>')
|
| 58 |
+
load: special tokens cache size = 26
|
| 59 |
+
load: token to piece cache size = 0.9311 MB
|
| 60 |
+
print_info: arch = qwen3
|
| 61 |
+
print_info: vocab_only = 0
|
| 62 |
+
print_info: n_ctx_train = 262144
|
| 63 |
+
print_info: n_embd = 2560
|
| 64 |
+
print_info: n_embd_inp = 2560
|
| 65 |
+
print_info: n_layer = 36
|
| 66 |
+
print_info: n_head = 32
|
| 67 |
+
print_info: n_head_kv = 8
|
| 68 |
+
print_info: n_rot = 128
|
| 69 |
+
print_info: n_swa = 0
|
| 70 |
+
print_info: is_swa_any = 0
|
| 71 |
+
print_info: n_embd_head_k = 128
|
| 72 |
+
print_info: n_embd_head_v = 128
|
| 73 |
+
print_info: n_gqa = 4
|
| 74 |
+
print_info: n_embd_k_gqa = 1024
|
| 75 |
+
print_info: n_embd_v_gqa = 1024
|
| 76 |
+
print_info: f_norm_eps = 0.0e+00
|
| 77 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 78 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 79 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 80 |
+
print_info: f_logit_scale = 0.0e+00
|
| 81 |
+
print_info: f_attn_scale = 0.0e+00
|
| 82 |
+
print_info: n_ff = 9728
|
| 83 |
+
print_info: n_expert = 0
|
| 84 |
+
print_info: n_expert_used = 0
|
| 85 |
+
print_info: n_expert_groups = 0
|
| 86 |
+
print_info: n_group_used = 0
|
| 87 |
+
print_info: causal attn = 1
|
| 88 |
+
print_info: pooling type = -1
|
| 89 |
+
print_info: rope type = 2
|
| 90 |
+
print_info: rope scaling = linear
|
| 91 |
+
print_info: freq_base_train = 5000000.0
|
| 92 |
+
print_info: freq_scale_train = 1
|
| 93 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 94 |
+
print_info: rope_finetuned = unknown
|
| 95 |
+
print_info: model type = 4B
|
| 96 |
+
print_info: model params = 4.02 B
|
| 97 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 98 |
+
print_info: vocab type = BPE
|
| 99 |
+
print_info: n_vocab = 151936
|
| 100 |
+
print_info: n_merges = 151387
|
| 101 |
+
print_info: BOS token = 11 ','
|
| 102 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 103 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 105 |
+
print_info: LF token = 198 'Ċ'
|
| 106 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 107 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 108 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 109 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 110 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 111 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 112 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 113 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 114 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 115 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 116 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 117 |
+
print_info: max token length = 256
|
| 118 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 119 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 120 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 121 |
+
load_tensors: CPU_Mapped model buffer size = 3822.21 MiB
|
| 122 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 123 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 124 |
+
............................................................................................
|
| 125 |
+
llama_context: constructing llama_context
|
| 126 |
+
llama_context: n_seq_max = 1
|
| 127 |
+
llama_context: n_ctx = 2048
|
| 128 |
+
llama_context: n_ctx_seq = 2048
|
| 129 |
+
llama_context: n_batch = 2048
|
| 130 |
+
llama_context: n_ubatch = 512
|
| 131 |
+
llama_context: causal_attn = 1
|
| 132 |
+
llama_context: flash_attn = auto
|
| 133 |
+
llama_context: kv_unified = false
|
| 134 |
+
llama_context: freq_base = 5000000.0
|
| 135 |
+
llama_context: freq_scale = 1
|
| 136 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 137 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 138 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 139 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 142 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 143 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 144 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 145 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 146 |
+
llama_context: graph nodes = 1267
|
| 147 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 148 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 153 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 154 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 155 |
+
|
| 156 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 157 |
+
perplexity: tokenizing the input ..
|
| 158 |
+
perplexity: tokenization took 44.458 ms
|
| 159 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 160 |
+
perplexity: 1.90 seconds per pass - ETA 0.50 minutes
|
| 161 |
+
[1]5.5359,[2]6.1706,[3]6.4228,[4]6.5973,[5]6.8415,[6]6.7857,[7]6.7389,[8]6.6387,[9]6.6636,[10]6.6115,[11]6.6502,[12]6.6353,[13]6.7208,[14]6.7234,[15]6.7222,[16]6.7086,
|
| 162 |
+
Final estimate: PPL = 6.7086 +/- 0.13691
|
| 163 |
+
|
| 164 |
+
llama_perf_context_print: load time = 1179.78 ms
|
| 165 |
+
llama_perf_context_print: prompt eval time = 26582.96 ms / 32768 tokens ( 0.81 ms per token, 1232.67 tokens per second)
|
| 166 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 167 |
+
llama_perf_context_print: total time = 27053.11 ms / 32769 tokens
|
| 168 |
+
llama_perf_context_print: graphs reused = 0
|
| 169 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 170 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16992 + (3048 = 1925 + 80 + 1043) + 4074 |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20866 + (2079 = 1925 + 80 + 74) + 1178 |
|
| 172 |
+
llama_memory_breakdown_print: | - Host | 3959 = 3822 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "5.04 GiB",
|
| 13 |
+
"t/s": "272.90 \u00b1 8.99",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 272.9
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 272.9
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5795,
|
| 24 |
+
"ppl_error": 0.01281
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_general.log",
|
| 28 |
+
"ppl": 9.1316,
|
| 29 |
+
"ppl_error": 0.21236
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_math.log",
|
| 33 |
+
"ppl": 6.8117,
|
| 34 |
+
"ppl_error": 0.1391
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 2.1476,
|
| 40 |
+
"bench_tps": 272.9,
|
| 41 |
+
"file_size_bytes": 5417721056,
|
| 42 |
+
"file_size_gb": 5.05
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 5.04 GiB | 4.02 B | CUDA | 35 | pp8 | 272.90 ± 8.99 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 5.04 GiB | 4.02 B | CUDA | 35 | tg128 | 39.54 ± 0.08 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_code.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20262 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 181 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 72 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 5.04 GiB (10.76 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 2705.96 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1227.55 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1227.55 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 113.051 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.50 seconds per pass - ETA 1.10 minutes
|
| 162 |
+
[1]3.2072,[2]2.5677,[3]1.8780,[4]1.7177,[5]1.8519,[6]1.9036,[7]1.8495,[8]1.8157,[9]1.7300,[10]1.6708,[11]1.6336,[12]1.6403,[13]1.6047,[14]1.5805,[15]1.6061,[16]1.5834,[17]1.5704,[18]1.5772,[19]1.5627,[20]1.5430,[21]1.5345,[22]1.5304,[23]1.5532,[24]1.5391,[25]1.5462,[26]1.5277,[27]1.5188,[28]1.5172,[29]1.5329,[30]1.5374,[31]1.5268,[32]1.5153,[33]1.5179,[34]1.5154,[35]1.5148,[36]1.5431,[37]1.5538,[38]1.5601,[39]1.5675,[40]1.5688,[41]1.5621,[42]1.5768,[43]1.5785,[44]1.5795,
|
| 163 |
+
Final estimate: PPL = 1.5795 +/- 0.01281
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 894.76 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 54461.67 ms / 90112 tokens ( 0.60 ms per token, 1654.60 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 55738.80 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17703 + (2351 = 1227 + 80 + 1043) + 4060 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21564 + (1381 = 1227 + 80 + 74) + 1178 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 2842 = 2705 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_general.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20263 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 181 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 72 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 5.04 GiB (10.76 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 2705.96 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1227.55 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1227.55 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 48.038 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.51 seconds per pass - ETA 0.37 minutes
|
| 162 |
+
[1]8.5699,[2]10.6892,[3]11.0839,[4]10.8094,[5]10.5103,[6]8.9338,[7]8.0102,[8]7.9568,[9]8.3628,[10]8.5234,[11]8.5433,[12]8.8897,[13]8.9283,[14]9.0560,[15]9.1316,
|
| 163 |
+
Final estimate: PPL = 9.1316 +/- 0.21236
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 891.85 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 18776.24 ms / 30720 tokens ( 0.61 ms per token, 1636.11 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 19220.86 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17712 + (2351 = 1227 + 80 + 1043) + 4051 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21564 + (1381 = 1227 + 80 + 74) + 1178 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 2842 = 2705 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4/perplexity_math.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20259 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_MXFP4.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 181 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 72 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 5.04 GiB (10.76 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 2705.96 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1227.55 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1227.55 MiB
|
| 125 |
+
.......................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 46.265 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.50 seconds per pass - ETA 0.40 minutes
|
| 162 |
+
[1]5.6268,[2]6.2341,[3]6.5286,[4]6.6997,[5]6.9395,[6]6.8761,[7]6.8359,[8]6.7338,[9]6.7589,[10]6.7071,[11]6.7352,[12]6.7257,[13]6.8171,[14]6.8243,[15]6.8253,[16]6.8117,
|
| 163 |
+
Final estimate: PPL = 6.8117 +/- 0.13910
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 895.18 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 20053.88 ms / 32768 tokens ( 0.61 ms per token, 1634.00 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 20522.38 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17707 + (2351 = 1227 + 80 + 1043) + 4056 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21564 + (1381 = 1227 + 80 + 74) + 1178 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 2842 = 2705 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "6.27 GiB",
|
| 13 |
+
"t/s": "262.63 \u00b1 3.24",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 262.63
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 262.63
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5719,
|
| 24 |
+
"ppl_error": 0.01276
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_general.log",
|
| 28 |
+
"ppl": 9.1346,
|
| 29 |
+
"ppl_error": 0.21386
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_math.log",
|
| 33 |
+
"ppl": 6.8907,
|
| 34 |
+
"ppl_error": 0.1431
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 2.3876,
|
| 40 |
+
"bench_tps": 262.63,
|
| 41 |
+
"file_size_bytes": 6734503136,
|
| 42 |
+
"file_size_gb": 6.27
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 6.27 GiB | 4.02 B | CUDA | 35 | pp8 | 262.63 ± 3.24 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 6.27 GiB | 4.02 B | CUDA | 35 | tg128 | 36.30 ± 0.16 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_code.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20236 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.27 GiB (13.38 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3264.09 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1576.38 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1576.38 MiB
|
| 125 |
+
..........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 115.92 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.71 seconds per pass - ETA 1.25 minutes
|
| 162 |
+
[1]3.2704,[2]2.5550,[3]1.8724,[4]1.7193,[5]1.8430,[6]1.8963,[7]1.8494,[8]1.8197,[9]1.7328,[10]1.6739,[11]1.6383,[12]1.6382,[13]1.6021,[14]1.5784,[15]1.6027,[16]1.5793,[17]1.5656,[18]1.5724,[19]1.5569,[20]1.5372,[21]1.5296,[22]1.5255,[23]1.5479,[24]1.5336,[25]1.5397,[26]1.5217,[27]1.5129,[28]1.5109,[29]1.5268,[30]1.5309,[31]1.5203,[32]1.5092,[33]1.5117,[34]1.5094,[35]1.5086,[36]1.5361,[37]1.5463,[38]1.5524,[39]1.5598,[40]1.5606,[41]1.5540,[42]1.5685,[43]1.5708,[44]1.5719,
|
| 163 |
+
Final estimate: PPL = 1.5719 +/- 0.01276
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1055.00 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 63841.73 ms / 90112 tokens ( 0.71 ms per token, 1411.49 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 65122.67 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17349 + (2700 = 1576 + 80 + 1043) + 4065 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21214 + (1730 = 1576 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3401 = 3264 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_general.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20243 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.27 GiB (13.38 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3264.09 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1576.38 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1576.38 MiB
|
| 125 |
+
..........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 50.992 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.72 seconds per pass - ETA 0.42 minutes
|
| 162 |
+
[1]8.3705,[2]10.6048,[3]11.0418,[4]10.7299,[5]10.4682,[6]8.9327,[7]8.0176,[8]7.9901,[9]8.4180,[10]8.5404,[11]8.5549,[12]8.8980,[13]8.9304,[14]9.0674,[15]9.1346,
|
| 163 |
+
Final estimate: PPL = 9.1346 +/- 0.21386
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1050.07 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 21931.81 ms / 30720 tokens ( 0.71 ms per token, 1400.71 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 22378.45 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17336 + (2700 = 1576 + 80 + 1043) + 4079 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21214 + (1730 = 1576 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3401 = 3264 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16/perplexity_math.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20250 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_MXFP4-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.27 GiB (13.38 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3264.09 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1576.38 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1576.38 MiB
|
| 125 |
+
..........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 44.87 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.73 seconds per pass - ETA 0.45 minutes
|
| 162 |
+
[1]5.6772,[2]6.3161,[3]6.6020,[4]6.7702,[5]7.0193,[6]6.9610,[7]6.9225,[8]6.8201,[9]6.8412,[10]6.7921,[11]6.8265,[12]6.8150,[13]6.9035,[14]6.9051,[15]6.9051,[16]6.8907,
|
| 163 |
+
Final estimate: PPL = 6.8907 +/- 0.14310
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1056.58 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 23437.41 ms / 32768 tokens ( 0.72 ms per token, 1398.11 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 23907.05 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17348 + (2700 = 1576 + 80 + 1043) + 4066 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21214 + (1730 = 1576 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3401 = 3264 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "6.96 GiB",
|
| 13 |
+
"t/s": "397.91 \u00b1 12.54",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 397.91
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 397.91
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5581,
|
| 24 |
+
"ppl_error": 0.01225
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log",
|
| 28 |
+
"ppl": 9.2059,
|
| 29 |
+
"ppl_error": 0.21164
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log",
|
| 33 |
+
"ppl": 6.9939,
|
| 34 |
+
"ppl_error": 0.14257
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 2.8706,
|
| 40 |
+
"bench_tps": 397.91,
|
| 41 |
+
"file_size_bytes": 7480005856,
|
| 42 |
+
"file_size_gb": 6.97
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 6.96 GiB | 4.02 B | CUDA | 35 | pp8 | 397.91 ± 12.54 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 6.96 GiB | 4.02 B | CUDA | 35 | tg128 | 53.87 ± 0.38 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20232 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 252 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 1 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.96 GiB (14.86 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3277.40 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 125 |
+
...................................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 498.81 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 111.736 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.84 seconds per pass - ETA 1.33 minutes
|
| 162 |
+
[1]3.2289,[2]2.5128,[3]1.8493,[4]1.6996,[5]1.8176,[6]1.8695,[7]1.8272,[8]1.7991,[9]1.7145,[10]1.6566,[11]1.6217,[12]1.6249,[13]1.5897,[14]1.5659,[15]1.5888,[16]1.5662,[17]1.5529,[18]1.5597,[19]1.5450,[20]1.5253,[21]1.5170,[22]1.5139,[23]1.5358,[24]1.5225,[25]1.5279,[26]1.5099,[27]1.5006,[28]1.4985,[29]1.5137,[30]1.5180,[31]1.5076,[32]1.4968,[33]1.5000,[34]1.4971,[35]1.4959,[36]1.5234,[37]1.5342,[38]1.5398,[39]1.5473,[40]1.5482,[41]1.5413,[42]1.5560,[43]1.5572,[44]1.5581,
|
| 163 |
+
Final estimate: PPL = 1.5581 +/- 0.01225
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1129.41 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 68734.47 ms / 90112 tokens ( 0.76 ms per token, 1311.02 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 70010.74 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17605 + (2504 = 1925 + 80 + 498) + 4005 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20866 + (2079 = 1925 + 80 + 74) + 1178 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3414 = 3277 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20242 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 252 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 1 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.96 GiB (14.86 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3277.40 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 125 |
+
...................................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 498.81 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 50.095 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.85 seconds per pass - ETA 0.45 minutes
|
| 162 |
+
[1]8.5279,[2]10.8660,[3]11.2124,[4]10.7685,[5]10.4615,[6]8.9192,[7]8.0016,[8]7.9952,[9]8.4381,[10]8.5842,[11]8.5996,[12]8.9555,[13]8.9881,[14]9.1356,[15]9.2059,
|
| 163 |
+
Final estimate: PPL = 9.2059 +/- 0.21164
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1136.29 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 23627.94 ms / 30720 tokens ( 0.77 ms per token, 1300.16 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 24074.73 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17615 + (2504 = 1925 + 80 + 498) + 3995 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20866 + (2079 = 1925 + 80 + 74) + 1178 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3414 = 3277 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20223 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_MXFP4-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 252 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 1 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.96 GiB (14.86 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3277.40 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 125 |
+
...................................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 498.81 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 44.239 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.83 seconds per pass - ETA 0.48 minutes
|
| 162 |
+
[1]5.8231,[2]6.5259,[3]6.7480,[4]6.9414,[5]7.1712,[6]7.1102,[7]7.0557,[8]6.9520,[9]6.9691,[10]6.9052,[11]6.9417,[12]6.9300,[13]7.0133,[14]7.0125,[15]7.0114,[16]6.9939,
|
| 163 |
+
Final estimate: PPL = 6.9939 +/- 0.14257
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1132.39 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 25182.01 ms / 32768 tokens ( 0.77 ms per token, 1301.25 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 25650.95 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17621 + (2504 = 1925 + 80 + 498) + 3989 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20866 + (2079 = 1925 + 80 + 74) + 1178 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3414 = 3277 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "6.98 GiB",
|
| 13 |
+
"t/s": "258.75 \u00b1 5.64",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 258.75
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 258.75
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5543,
|
| 24 |
+
"ppl_error": 0.01237
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log",
|
| 28 |
+
"ppl": 9.0087,
|
| 29 |
+
"ppl_error": 0.20944
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log",
|
| 33 |
+
"ppl": 6.7266,
|
| 34 |
+
"ppl_error": 0.13709
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.7206,
|
| 40 |
+
"bench_tps": 258.75,
|
| 41 |
+
"file_size_bytes": 7496850656,
|
| 42 |
+
"file_size_gb": 6.98
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 6.98 GiB | 4.02 B | CUDA | 35 | pp8 | 258.75 ± 5.64 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 6.98 GiB | 4.02 B | CUDA | 35 | tg128 | 34.52 ± 0.15 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20239 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.98 GiB (14.90 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3587.21 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1778.33 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1778.33 MiB
|
| 125 |
+
...........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 112.689 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.82 seconds per pass - ETA 1.32 minutes
|
| 162 |
+
[1]3.1615,[2]2.4920,[3]1.8395,[4]1.6935,[5]1.8109,[6]1.8663,[7]1.8192,[8]1.7888,[9]1.7049,[10]1.6483,[11]1.6139,[12]1.6156,[13]1.5815,[14]1.5590,[15]1.5806,[16]1.5584,[17]1.5457,[18]1.5527,[19]1.5383,[20]1.5188,[21]1.5115,[22]1.5080,[23]1.5299,[24]1.5169,[25]1.5227,[26]1.5046,[27]1.4954,[28]1.4936,[29]1.5091,[30]1.5130,[31]1.5028,[32]1.4920,[33]1.4947,[34]1.4922,[35]1.4917,[36]1.5195,[37]1.5295,[38]1.5355,[39]1.5429,[40]1.5443,[41]1.5377,[42]1.5520,[43]1.5534,[44]1.5543,
|
| 163 |
+
Final estimate: PPL = 1.5543 +/- 0.01237
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1130.07 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 68348.84 ms / 90112 tokens ( 0.76 ms per token, 1318.41 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 69631.62 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17124 + (2901 = 1778 + 80 + 1043) + 4088 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21012 + (1932 = 1778 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3724 = 3587 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20236 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.98 GiB (14.90 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3587.21 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1778.33 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1778.33 MiB
|
| 125 |
+
...........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 48.138 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.83 seconds per pass - ETA 0.45 minutes
|
| 162 |
+
[1]8.3747,[2]10.5787,[3]10.9382,[4]10.5724,[5]10.2969,[6]8.7639,[7]7.8650,[8]7.8456,[9]8.2816,[10]8.4241,[11]8.4404,[12]8.7813,[13]8.8096,[14]8.9343,[15]9.0087,
|
| 163 |
+
Final estimate: PPL = 9.0087 +/- 0.20944
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1119.23 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 23569.98 ms / 30720 tokens ( 0.77 ms per token, 1303.35 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 24012.42 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17136 + (2901 = 1778 + 80 + 1043) + 4077 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21012 + (1932 = 1778 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3724 = 3587 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20226 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_BF16-attn_q_MXFP4-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.98 GiB (14.90 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3587.21 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1778.33 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1778.33 MiB
|
| 125 |
+
...........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 44.615 ms
|
| 160 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.82 seconds per pass - ETA 0.48 minutes
|
| 162 |
+
[1]5.5670,[2]6.2043,[3]6.4459,[4]6.6287,[5]6.8568,[6]6.8001,[7]6.7590,[8]6.6583,[9]6.6852,[10]6.6316,[11]6.6697,[12]6.6539,[13]6.7366,[14]6.7423,[15]6.7408,[16]6.7266,
|
| 163 |
+
Final estimate: PPL = 6.7266 +/- 0.13709
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1124.66 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 25130.11 ms / 32768 tokens ( 0.77 ms per token, 1303.93 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 25599.70 ms / 32769 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17133 + (2901 = 1778 + 80 + 1043) + 4079 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21012 + (1932 = 1778 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3724 = 3587 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "3.97 GiB",
|
| 13 |
+
"t/s": "358.71 \u00b1 6.70",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 358.71
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 358.71
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5476,
|
| 24 |
+
"ppl_error": 0.01221
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_general.log",
|
| 28 |
+
"ppl": 8.8795,
|
| 29 |
+
"ppl_error": 0.20529
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_math.log",
|
| 33 |
+
"ppl": 6.7286,
|
| 34 |
+
"ppl_error": 0.13733
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.1276,
|
| 40 |
+
"bench_tps": 358.71,
|
| 41 |
+
"file_size_bytes": 4268608736,
|
| 42 |
+
"file_size_gb": 3.98
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 3.97 GiB | 4.02 B | CUDA | 35 | pp8 | 358.71 ± 6.70 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 3.97 GiB | 4.02 B | CUDA | 35 | tg128 | 58.99 ± 0.73 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_code.log
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20215 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round3_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 145 tensors
|
| 49 |
+
llama_model_loader: - type iq4_nl: 36 tensors
|
| 50 |
+
llama_model_loader: - type bf16: 72 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 3.97 GiB (8.48 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 2560
|
| 66 |
+
print_info: n_embd_inp = 2560
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 9728
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = -1
|
| 91 |
+
print_info: rope type = 2
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: model type = 4B
|
| 98 |
+
print_info: model params = 4.02 B
|
| 99 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 11 ','
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 2025.71 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1019.74 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1019.74 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 695.87 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 119.268 ms
|
| 161 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.36 seconds per pass - ETA 0.98 minutes
|
| 163 |
+
[1]3.1607,[2]2.4745,[3]1.8303,[4]1.6866,[5]1.8041,[6]1.8549,[7]1.8090,[8]1.7799,[9]1.6975,[10]1.6417,[11]1.6083,[12]1.6106,[13]1.5765,[14]1.5541,[15]1.5762,[16]1.5541,[17]1.5411,[18]1.5480,[19]1.5338,[20]1.5142,[21]1.5069,[22]1.5033,[23]1.5251,[24]1.5119,[25]1.5172,[26]1.4997,[27]1.4904,[28]1.4886,[29]1.5041,[30]1.5075,[31]1.4972,[32]1.4867,[33]1.4895,[34]1.4869,[35]1.4862,[36]1.5131,[37]1.5233,[38]1.5291,[39]1.5365,[40]1.5377,[41]1.5312,[42]1.5453,[43]1.5466,[44]1.5476,
|
| 164 |
+
Final estimate: PPL = 1.5476 +/- 0.01221
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 786.71 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 46093.83 ms / 90112 tokens ( 0.51 ms per token, 1954.97 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 47388.09 ms / 90113 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18215 + (1795 = 1019 + 80 + 695) + 4104 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21772 + (1173 = 1019 + 80 + 74) + 1178 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 2162 = 2025 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_general.log
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20224 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round3_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 145 tensors
|
| 49 |
+
llama_model_loader: - type iq4_nl: 36 tensors
|
| 50 |
+
llama_model_loader: - type bf16: 72 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 3.97 GiB (8.48 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 2560
|
| 66 |
+
print_info: n_embd_inp = 2560
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 9728
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = -1
|
| 91 |
+
print_info: rope type = 2
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: model type = 4B
|
| 98 |
+
print_info: model params = 4.02 B
|
| 99 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 11 ','
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 2025.71 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1019.74 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1019.74 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 695.87 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 48.994 ms
|
| 161 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.33 seconds per pass - ETA 0.32 minutes
|
| 163 |
+
[1]8.2649,[2]10.3975,[3]10.7435,[4]10.4007,[5]10.1210,[6]8.6322,[7]7.7425,[8]7.7216,[9]8.1426,[10]8.2797,[11]8.3043,[12]8.6426,[13]8.6798,[14]8.8125,[15]8.8795,
|
| 164 |
+
Final estimate: PPL = 8.8795 +/- 0.20529
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 760.28 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 15901.61 ms / 30720 tokens ( 0.52 ms per token, 1931.88 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 16361.26 ms / 30721 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18306 + (1795 = 1019 + 80 + 695) + 4012 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21772 + (1173 = 1019 + 80 + 74) + 1178 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 2162 = 2025 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0/perplexity_math.log
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20123 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round3_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_IQ4_NL-attn_q_Q8_0-embeddings_Q8_0-ffn_down_Q8_0-ffn_up_gate_Q8_0.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type q8_0: 145 tensors
|
| 49 |
+
llama_model_loader: - type iq4_nl: 36 tensors
|
| 50 |
+
llama_model_loader: - type bf16: 72 tensors
|
| 51 |
+
print_info: file format = GGUF V3 (latest)
|
| 52 |
+
print_info: file type = MXFP4 MoE
|
| 53 |
+
print_info: file size = 3.97 GiB (8.48 BPW)
|
| 54 |
+
load: printing all EOG tokens:
|
| 55 |
+
load: - 151643 ('<|endoftext|>')
|
| 56 |
+
load: - 151645 ('<|im_end|>')
|
| 57 |
+
load: - 151662 ('<|fim_pad|>')
|
| 58 |
+
load: - 151663 ('<|repo_name|>')
|
| 59 |
+
load: - 151664 ('<|file_sep|>')
|
| 60 |
+
load: special tokens cache size = 26
|
| 61 |
+
load: token to piece cache size = 0.9311 MB
|
| 62 |
+
print_info: arch = qwen3
|
| 63 |
+
print_info: vocab_only = 0
|
| 64 |
+
print_info: n_ctx_train = 262144
|
| 65 |
+
print_info: n_embd = 2560
|
| 66 |
+
print_info: n_embd_inp = 2560
|
| 67 |
+
print_info: n_layer = 36
|
| 68 |
+
print_info: n_head = 32
|
| 69 |
+
print_info: n_head_kv = 8
|
| 70 |
+
print_info: n_rot = 128
|
| 71 |
+
print_info: n_swa = 0
|
| 72 |
+
print_info: is_swa_any = 0
|
| 73 |
+
print_info: n_embd_head_k = 128
|
| 74 |
+
print_info: n_embd_head_v = 128
|
| 75 |
+
print_info: n_gqa = 4
|
| 76 |
+
print_info: n_embd_k_gqa = 1024
|
| 77 |
+
print_info: n_embd_v_gqa = 1024
|
| 78 |
+
print_info: f_norm_eps = 0.0e+00
|
| 79 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 80 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 81 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 82 |
+
print_info: f_logit_scale = 0.0e+00
|
| 83 |
+
print_info: f_attn_scale = 0.0e+00
|
| 84 |
+
print_info: n_ff = 9728
|
| 85 |
+
print_info: n_expert = 0
|
| 86 |
+
print_info: n_expert_used = 0
|
| 87 |
+
print_info: n_expert_groups = 0
|
| 88 |
+
print_info: n_group_used = 0
|
| 89 |
+
print_info: causal attn = 1
|
| 90 |
+
print_info: pooling type = -1
|
| 91 |
+
print_info: rope type = 2
|
| 92 |
+
print_info: rope scaling = linear
|
| 93 |
+
print_info: freq_base_train = 5000000.0
|
| 94 |
+
print_info: freq_scale_train = 1
|
| 95 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 96 |
+
print_info: rope_finetuned = unknown
|
| 97 |
+
print_info: model type = 4B
|
| 98 |
+
print_info: model params = 4.02 B
|
| 99 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 100 |
+
print_info: vocab type = BPE
|
| 101 |
+
print_info: n_vocab = 151936
|
| 102 |
+
print_info: n_merges = 151387
|
| 103 |
+
print_info: BOS token = 11 ','
|
| 104 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 106 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 107 |
+
print_info: LF token = 198 'Ċ'
|
| 108 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 109 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 110 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 111 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 115 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 116 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 117 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 118 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 119 |
+
print_info: max token length = 256
|
| 120 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 121 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 122 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 123 |
+
load_tensors: CPU_Mapped model buffer size = 2025.71 MiB
|
| 124 |
+
load_tensors: CUDA0 model buffer size = 1019.74 MiB
|
| 125 |
+
load_tensors: CUDA1 model buffer size = 1019.74 MiB
|
| 126 |
+
............................................................................................
|
| 127 |
+
llama_context: constructing llama_context
|
| 128 |
+
llama_context: n_seq_max = 1
|
| 129 |
+
llama_context: n_ctx = 2048
|
| 130 |
+
llama_context: n_ctx_seq = 2048
|
| 131 |
+
llama_context: n_batch = 2048
|
| 132 |
+
llama_context: n_ubatch = 512
|
| 133 |
+
llama_context: causal_attn = 1
|
| 134 |
+
llama_context: flash_attn = auto
|
| 135 |
+
llama_context: kv_unified = false
|
| 136 |
+
llama_context: freq_base = 5000000.0
|
| 137 |
+
llama_context: freq_scale = 1
|
| 138 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 139 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 140 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 143 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 144 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 145 |
+
llama_context: CUDA0 compute buffer size = 695.87 MiB
|
| 146 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 147 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 148 |
+
llama_context: graph nodes = 1267
|
| 149 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 150 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 154 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 155 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 156 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 157 |
+
|
| 158 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 159 |
+
perplexity: tokenizing the input ..
|
| 160 |
+
perplexity: tokenization took 47.851 ms
|
| 161 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 162 |
+
perplexity: 1.35 seconds per pass - ETA 0.35 minutes
|
| 163 |
+
[1]5.5834,[2]6.2280,[3]6.4773,[4]6.6423,[5]6.8724,[6]6.8126,[7]6.7617,[8]6.6564,[9]6.6810,[10]6.6315,[11]6.6698,[12]6.6584,[13]6.7408,[14]6.7456,[15]6.7412,[16]6.7286,
|
| 164 |
+
Final estimate: PPL = 6.7286 +/- 0.13733
|
| 165 |
+
|
| 166 |
+
llama_perf_context_print: load time = 791.64 ms
|
| 167 |
+
llama_perf_context_print: prompt eval time = 16922.89 ms / 32768 tokens ( 0.52 ms per token, 1936.31 tokens per second)
|
| 168 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 169 |
+
llama_perf_context_print: total time = 17393.38 ms / 32769 tokens
|
| 170 |
+
llama_perf_context_print: graphs reused = 0
|
| 171 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 18221 + (1795 = 1019 + 80 + 695) + 4098 |
|
| 173 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21772 + (1173 = 1019 + 80 + 74) + 1178 |
|
| 174 |
+
llama_memory_breakdown_print: | - Host | 2162 = 2025 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "qwen3 4B MXFP4 MoE",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "4.02 B",
|
| 12 |
+
"size": "6.98 GiB",
|
| 13 |
+
"t/s": "254.90 \u00b1 7.52",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 254.9
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 254.9
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log",
|
| 23 |
+
"ppl": 1.5442,
|
| 24 |
+
"ppl_error": 0.01204
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log",
|
| 28 |
+
"ppl": 8.8795,
|
| 29 |
+
"ppl_error": 0.20477
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_math.log",
|
| 33 |
+
"ppl": 6.6896,
|
| 34 |
+
"ppl_error": 0.13572
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.1657,
|
| 40 |
+
"bench_tps": 254.9,
|
| 41 |
+
"file_size_bytes": 7496850656,
|
| 42 |
+
"file_size_gb": 6.98
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 6.98 GiB | 4.02 B | CUDA | 35 | pp8 | 254.90 ± 7.52 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 6.98 GiB | 4.02 B | CUDA | 35 | tg128 | 34.49 ± 0.16 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_code.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20267 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.98 GiB (14.90 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3587.21 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1778.33 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1778.33 MiB
|
| 125 |
+
...........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 112.721 ms
|
| 160 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.82 seconds per pass - ETA 1.33 minutes
|
| 162 |
+
[1]3.1343,[2]2.4621,[3]1.8240,[4]1.6837,[5]1.7982,[6]1.8483,[7]1.8052,[8]1.7759,[9]1.6942,[10]1.6386,[11]1.6054,[12]1.6079,[13]1.5744,[14]1.5519,[15]1.5735,[16]1.5515,[17]1.5388,[18]1.5451,[19]1.5311,[20]1.5118,[21]1.5038,[22]1.5002,[23]1.5213,[24]1.5083,[25]1.5138,[26]1.4964,[27]1.4872,[28]1.4854,[29]1.5008,[30]1.5042,[31]1.4941,[32]1.4834,[33]1.4858,[34]1.4833,[35]1.4827,[36]1.5100,[37]1.5199,[38]1.5260,[39]1.5334,[40]1.5345,[41]1.5281,[42]1.5421,[43]1.5435,[44]1.5442,
|
| 163 |
+
Final estimate: PPL = 1.5442 +/- 0.01204
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1125.22 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 68813.12 ms / 90112 tokens ( 0.76 ms per token, 1309.52 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 70091.81 ms / 90113 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17159 + (2901 = 1778 + 80 + 1043) + 4053 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21012 + (1932 = 1778 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3724 = 3587 + 128 + 9 |
|
Benchmarks/DataCollection/Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16/perplexity_general.log
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20260 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from /mnt/world8/AI/ToBench/Qwen3-4B-Instruct-2507-unsloth/Magic_Quant/GGUF/dc_round-1_Qwen3-4B-Instruct-2507-unsloth-mxfp4_moe-attn_kv_BF16-attn_output_MXFP4-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507 Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct-unsloth
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.count u32 = 1
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 4B Instruct 2507
|
| 22 |
+
llama_model_loader: - kv 11: general.base_model.0.version str = 2507
|
| 23 |
+
llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen
|
| 24 |
+
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 25 |
+
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.block_count u32 = 36
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.context_length u32 = 262144
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32
|
| 31 |
+
llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8
|
| 32 |
+
llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 5000000.000000
|
| 33 |
+
llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 34 |
+
llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128
|
| 35 |
+
llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151654
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
|
| 44 |
+
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 45 |
+
llama_model_loader: - kv 34: general.quantization_version u32 = 2
|
| 46 |
+
llama_model_loader: - kv 35: general.file_type u32 = 38
|
| 47 |
+
llama_model_loader: - type f32: 145 tensors
|
| 48 |
+
llama_model_loader: - type bf16: 217 tensors
|
| 49 |
+
llama_model_loader: - type mxfp4: 36 tensors
|
| 50 |
+
print_info: file format = GGUF V3 (latest)
|
| 51 |
+
print_info: file type = MXFP4 MoE
|
| 52 |
+
print_info: file size = 6.98 GiB (14.90 BPW)
|
| 53 |
+
load: printing all EOG tokens:
|
| 54 |
+
load: - 151643 ('<|endoftext|>')
|
| 55 |
+
load: - 151645 ('<|im_end|>')
|
| 56 |
+
load: - 151662 ('<|fim_pad|>')
|
| 57 |
+
load: - 151663 ('<|repo_name|>')
|
| 58 |
+
load: - 151664 ('<|file_sep|>')
|
| 59 |
+
load: special tokens cache size = 26
|
| 60 |
+
load: token to piece cache size = 0.9311 MB
|
| 61 |
+
print_info: arch = qwen3
|
| 62 |
+
print_info: vocab_only = 0
|
| 63 |
+
print_info: n_ctx_train = 262144
|
| 64 |
+
print_info: n_embd = 2560
|
| 65 |
+
print_info: n_embd_inp = 2560
|
| 66 |
+
print_info: n_layer = 36
|
| 67 |
+
print_info: n_head = 32
|
| 68 |
+
print_info: n_head_kv = 8
|
| 69 |
+
print_info: n_rot = 128
|
| 70 |
+
print_info: n_swa = 0
|
| 71 |
+
print_info: is_swa_any = 0
|
| 72 |
+
print_info: n_embd_head_k = 128
|
| 73 |
+
print_info: n_embd_head_v = 128
|
| 74 |
+
print_info: n_gqa = 4
|
| 75 |
+
print_info: n_embd_k_gqa = 1024
|
| 76 |
+
print_info: n_embd_v_gqa = 1024
|
| 77 |
+
print_info: f_norm_eps = 0.0e+00
|
| 78 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 79 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 80 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 81 |
+
print_info: f_logit_scale = 0.0e+00
|
| 82 |
+
print_info: f_attn_scale = 0.0e+00
|
| 83 |
+
print_info: n_ff = 9728
|
| 84 |
+
print_info: n_expert = 0
|
| 85 |
+
print_info: n_expert_used = 0
|
| 86 |
+
print_info: n_expert_groups = 0
|
| 87 |
+
print_info: n_group_used = 0
|
| 88 |
+
print_info: causal attn = 1
|
| 89 |
+
print_info: pooling type = -1
|
| 90 |
+
print_info: rope type = 2
|
| 91 |
+
print_info: rope scaling = linear
|
| 92 |
+
print_info: freq_base_train = 5000000.0
|
| 93 |
+
print_info: freq_scale_train = 1
|
| 94 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 95 |
+
print_info: rope_finetuned = unknown
|
| 96 |
+
print_info: model type = 4B
|
| 97 |
+
print_info: model params = 4.02 B
|
| 98 |
+
print_info: general.name = Qwen3 4B Instruct 2507 Unsloth
|
| 99 |
+
print_info: vocab type = BPE
|
| 100 |
+
print_info: n_vocab = 151936
|
| 101 |
+
print_info: n_merges = 151387
|
| 102 |
+
print_info: BOS token = 11 ','
|
| 103 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 104 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 105 |
+
print_info: PAD token = 151654 '<|vision_pad|>'
|
| 106 |
+
print_info: LF token = 198 'Ċ'
|
| 107 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 108 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 109 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 110 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 114 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 115 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 116 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 117 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 118 |
+
print_info: max token length = 256
|
| 119 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 120 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 121 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 122 |
+
load_tensors: CPU_Mapped model buffer size = 3587.21 MiB
|
| 123 |
+
load_tensors: CUDA0 model buffer size = 1778.33 MiB
|
| 124 |
+
load_tensors: CUDA1 model buffer size = 1778.33 MiB
|
| 125 |
+
...........................................................................................
|
| 126 |
+
llama_context: constructing llama_context
|
| 127 |
+
llama_context: n_seq_max = 1
|
| 128 |
+
llama_context: n_ctx = 2048
|
| 129 |
+
llama_context: n_ctx_seq = 2048
|
| 130 |
+
llama_context: n_batch = 2048
|
| 131 |
+
llama_context: n_ubatch = 512
|
| 132 |
+
llama_context: causal_attn = 1
|
| 133 |
+
llama_context: flash_attn = auto
|
| 134 |
+
llama_context: kv_unified = false
|
| 135 |
+
llama_context: freq_base = 5000000.0
|
| 136 |
+
llama_context: freq_scale = 1
|
| 137 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 138 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 139 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 140 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 141 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 142 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 143 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 144 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 145 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 146 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 147 |
+
llama_context: graph nodes = 1267
|
| 148 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 149 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 150 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 151 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 152 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 153 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 154 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 155 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 156 |
+
|
| 157 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 158 |
+
perplexity: tokenizing the input ..
|
| 159 |
+
perplexity: tokenization took 48.133 ms
|
| 160 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 161 |
+
perplexity: 1.82 seconds per pass - ETA 0.45 minutes
|
| 162 |
+
[1]8.2731,[2]10.3482,[3]10.7579,[4]10.3988,[5]10.1216,[6]8.6329,[7]7.7520,[8]7.7283,[9]8.1555,[10]8.2887,[11]8.3150,[12]8.6413,[13]8.6724,[14]8.8149,[15]8.8795,
|
| 163 |
+
Final estimate: PPL = 8.8795 +/- 0.20477
|
| 164 |
+
|
| 165 |
+
llama_perf_context_print: load time = 1135.43 ms
|
| 166 |
+
llama_perf_context_print: prompt eval time = 23588.29 ms / 30720 tokens ( 0.77 ms per token, 1302.34 tokens per second)
|
| 167 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 168 |
+
llama_perf_context_print: total time = 24030.83 ms / 30721 tokens
|
| 169 |
+
llama_perf_context_print: graphs reused = 0
|
| 170 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 171 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 17155 + (2901 = 1778 + 80 + 1043) + 4058 |
|
| 172 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21012 + (1932 = 1778 + 80 + 74) + 1179 |
|
| 173 |
+
llama_memory_breakdown_print: | - Host | 3724 = 3587 + 128 + 9 |
|