Note: Overall leaderboard rankings may not reflect true model quality — individual benchmarks give a clearer picture. ARC-Challenge MMLU GPQA GSM8K Artificial Analysis Intelligence Index v4.0
← Back to leaderboard

SuperGPQA

19 models

SuperGPQA (Scaling LLM Evaluation across 285 Graduate Disciplines) measures an AI's graduate-level knowledge and reasoning across 285 specialized academic disciplines — far beyond mainstream subjects, covering light industry, agriculture, service‑oriented fields, and more. It employs a Human‑LLM collaborative filtering mechanism with over 80 expert annotators to eliminate trivial or ambiguous questions, creating a challenging benchmark that reveals significant gaps between current model capabilities and artificial general intelligence.

Top 10 Models Performance

qwen/qwen3.6-plus ######################################## 71.6
qwen/qwen3.5-397b-a17b ####################################### 70.4
qwen/qwen3.5-122b-a10b ##################################### 67.1
orionllm/grm-2.6-plus ##################################### 66.4
qwen/qwen3.5-27b ##################################### 66
qwen/qwen3.5-9b ################################# 58.2
tencent/hy3-preview-base ############################# 51.6
openai/gpt-oss-20b ######################## 42.6
qwen/qwen3-8b ###################### 39.8
tiger-lab/general-reasoner-qwen3-4b ################## 32.5
68.8K – 862.0B
Rank Model Score
🥇 qwen/qwen3.6-plus 71.6
🥈 qwen/qwen3.5-397b-a17b 70.4
🥉 qwen/qwen3.5-122b-a10b 67.1
4 orionllm/grm-2.6-plus 66.4
5 qwen/qwen3.5-27b 66
6 qwen/qwen3.5-9b 58.2
7 tencent/hy3-preview-base 51.6
8 openai/gpt-oss-20b 42.6
9 qwen/qwen3-8b 39.8
10 tiger-lab/general-reasoner-qwen3-4b 32.5
11 essentialai/rnj-1-instruct 28.2
12 qwen/qwen2.5-coder-7b-instruct 23.5
13 openbmb/minicpm5-1b 23.14
14 qwen/qwen3-1.7b-base 20.92
15 qwen/qwen2.5-1.5b 17.64
16 qwen/qwen3.5-0.8b 16.9
17 qwen/qwen3-0.6b-base 15.03
18 qwen/qwen2.5-0.5b 11.3
19 google/gemma-3-1b-pt 7.19