|
输入价格
|
$/1M tokens |
$0.50 |
$2.00 $4.00 > 200k
tokens |
$0.30 |
$1.25 $2.50 > 200k
tokens |
$3.00 $6.00 / MTok > 200k tokens
|
$1.75 |
$0.20 |
|
输出价格
|
$/1M tokens |
$3.00 |
$12.00 $18.00 > 200k
tokens |
$2.50 |
$10.00 $15.00 > 200k
tokens |
$15.00 $22.50 > 200k tokens
|
$14.00 |
$0.50 |
学术推理
Humanity's Last Exam
|
No tools |
33.7% |
37.5% |
11.0% |
21.6% |
13.7% |
34.5% |
17.6% |
复杂图表信息综合
CharXiv Reasoning
|
No tools |
80.3% |
81.4% |
63.7% |
69.6% |
68.5% |
82.1% |
— |
OCR
OmniDocBench 1.5
|
Overall Edit Distance, lower is better |
0.121 |
0.115 |
0.154 |
0.145 |
0.145 |
0.143 |
— |
来自 Codeforces、ICPC 和 IOI 的竞赛级编程题目
LiveCodeBench Pro
|
Elo 评分,越高越好 |
2316 |
2439 |
1143 |
1775 |
1418 |
2393 |
— |
智能体终端编程
Terminal-Bench 2.0
|
Terminus-2 测试框架 |
47.6%
|
54.2% |
16.9% |
32.60%
|
42.8% |
— |
— |
基于 MCP 协议的多步工作流自动化
MCP Atlas
|
|
57.4%
|
54.1%
|
3.4%
|
8.8%
|
43.8% |
60.6% |
— |
涵盖基准事实、参数化知识、搜索以及多模态的事实性基准测试
FACTS Benchmark Suite
|
|
61.9%
|
70.5%
|
50.4%
|
63.4%
|
48.9% |
61.4% |
42.1% |
跨越 100 种语言和文化的常识推理
Global PIQA
|
|
92.8%
|
93.4%
|
90.2%
|
91.5%
|
90.1%
|
91.2%
|
85.6%
|
长上下文性能
MRCR v2 (8-needle)
|
128k (平均值)
|
67.2%
|
77.0%
|
54.3%
|
58.0%
|
47.1%
|
81.9%
|
54.6%
|