update bmk numbers
Browse files
README.md
CHANGED
|
@@ -52,29 +52,29 @@ Performance of Step 3.5 Flash measured across **Reasoning**, **Coding**, and **A
|
|
| 52 |
### Detailed Benchmarks
|
| 53 |
|
| 54 |
| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash |
|
| 55 |
-
|
| 56 |
| # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B |
|
| 57 |
| # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B |
|
| 58 |
-
| Est. decoding cost
|
| 59 |
-
|
|
| 60 |
-
| τ²-Bench |
|
| 61 |
-
| BrowseComp | 51.6 | 51.4 | 41.5* /
|
| 62 |
-
| BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2
|
| 63 |
-
| BrowseComp-ZH |
|
| 64 |
-
| BrowseComp-ZH (w/ Context Manager) |
|
| 65 |
-
| GAIA (no file) |
|
| 66 |
-
| xbench-DeepSearch (2025.05) |
|
| 67 |
-
| xbench-DeepSearch (2025.10) |
|
| 68 |
-
| ResearchRubrics |
|
| 69 |
-
|
|
| 70 |
-
| AIME 2025 |
|
| 71 |
-
| HMMT 2025 (Feb.) |
|
| 72 |
-
| HMMT 2025 (Nov.) |
|
| 73 |
-
| IMOAnswerBench |
|
| 74 |
-
|
|
| 75 |
-
| LiveCodeBench-V6 |
|
| 76 |
-
| SWE-bench Verified | 74.4 | 73.1 | 71.3
|
| 77 |
-
| Terminal-Bench 2.0 |
|
| 78 |
|
| 79 |
**Notes**:
|
| 80 |
1. "—" indicates the score is not publicly available or not tested.
|
|
|
|
| 52 |
### Detailed Benchmarks
|
| 53 |
|
| 54 |
| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash |
|
| 55 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 56 |
| # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B |
|
| 57 |
| # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B |
|
| 58 |
+
| Est. decoding cost @ 128K context, Hopper GPU** | **1.0x**<br>100 tok/s, MTP-3, EP8 | **6.0x**<br>33 tok/s, MTP-1, EP32 | **18.9x**<br>33 tok/s, no MTP, EP32 | **18.9x**<br>100 tok/s, MTP-3, EP8 | **3.9x**<br>100 tok/s, MTP-3, EP8 | **1.2x**<br>100 tok/s, MTP-3, EP8 |
|
| 59 |
+
| | | | **Agent** | | | |
|
| 60 |
+
| τ²-Bench | 88.2 | 80.3 (85.2*) | 74.3*/85.4* | 87.4 | 86.6* | 80.3 (84.1*) |
|
| 61 |
+
| BrowseComp | 51.6 | 51.4 | 41.5* / 60.6 | 52.0 | 47.4 | 45.4 |
|
| 62 |
+
| BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2/74.9 | 67.5 | 62.0 | 58.3 |
|
| 63 |
+
| BrowseComp-ZH | 66.9 | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* |
|
| 64 |
+
| BrowseComp-ZH (w/ Context Manager) | 73.7 | — | —/— | — | — | — |
|
| 65 |
+
| GAIA (no file) | 84.5 | 75.1* | 75.6*/75.9* | 61.9* | 64.3* | 78.2* |
|
| 66 |
+
| xbench-DeepSearch (2025.05) | 83.7 | 78.0* | 76.0*/76.7* | 72.0* | 68.7* | 69.3* |
|
| 67 |
+
| xbench-DeepSearch (2025.10) | 56.3 | 55.7* | —/40+ | 52.3* | 43.0* | 44.0* |
|
| 68 |
+
| ResearchRubrics | 65.3 | 55.8* | 56.2*/59.5* | 62.0* | 60.2* | 54.3* |
|
| 69 |
+
| | | | **Reasoning** | | | |
|
| 70 |
+
| AIME 2025 | 97.3 | 93.1 | 94.5/96.1 | 95.7 | 83.0 | 94.1 (95.1*) |
|
| 71 |
+
| HMMT 2025 (Feb.) | 98.4 | 92.5 | 89.4/95.4 | 97.1 | 71.0* | 84.4 (95.4*) |
|
| 72 |
+
| HMMT 2025 (Nov.) | 94.0 | 90.2 | 89.2*/— | 93.5 | 74.3* | 91.0* |
|
| 73 |
+
| IMOAnswerBench | 85.4 | 78.3 | 78.6/81.8 | 82.0 | 60.4* | 80.9* |
|
| 74 |
+
| | | | **Coding** | | | |
|
| 75 |
+
| LiveCodeBench-V6 | 86.4 | 83.3 | 83.1/85.0 | 84.9 | — | 80.6 (81.6*) |
|
| 76 |
+
| SWE-bench Verified | 74.4 | 73.1 | 71.3/76.8 | 73.8 | 74.0 | 73.4 |
|
| 77 |
+
| Terminal-Bench 2.0 | 51.0 | 46.4 | 35.7*/50.8 | 41.0 | 47.9 | 38.5 |
|
| 78 |
|
| 79 |
**Notes**:
|
| 80 |
1. "—" indicates the score is not publicly available or not tested.
|