SuperGPQA
19 models
SuperGPQA (Scaling LLM Evaluation across 285 Graduate Disciplines) measures an AI's graduate-level knowledge and reasoning across 285 specialized academic disciplines — far beyond mainstream subjects, covering light industry, agriculture, service‑oriented fields, and more. It employs a Human‑LLM collaborative filtering mechanism with over 80 expert annotators to eliminate trivial or ambiguous questions, creating a challenging benchmark that reveals significant gaps between current model capabilities and artificial general intelligence.
Top 10 Models Performance
| qwen/qwen3.6-plus | ######################################## | 71.6 |
| qwen/qwen3.5-397b-a17b | ####################################### | 70.4 |
| qwen/qwen3.5-122b-a10b | ##################################### | 67.1 |
| orionllm/grm-2.6-plus | ##################################### | 66.4 |
| qwen/qwen3.5-27b | ##################################### | 66 |
| qwen/qwen3.5-9b | ################################# | 58.2 |
| tencent/hy3-preview-base | ############################# | 51.6 |
| openai/gpt-oss-20b | ######################## | 42.6 |
| qwen/qwen3-8b | ###################### | 39.8 |
| tiger-lab/general-reasoner-qwen3-4b | ################## | 32.5 |
| Rank | Model | Score |
|---|---|---|
| 🥇 | qwen/qwen3.6-plus | 71.6 |
| 🥈 | qwen/qwen3.5-397b-a17b | 70.4 |
| 🥉 | qwen/qwen3.5-122b-a10b | 67.1 |
| 4 | orionllm/grm-2.6-plus | 66.4 |
| 5 | qwen/qwen3.5-27b | 66 |
| 6 | qwen/qwen3.5-9b | 58.2 |
| 7 | tencent/hy3-preview-base | 51.6 |
| 8 | openai/gpt-oss-20b | 42.6 |
| 9 | qwen/qwen3-8b | 39.8 |
| 10 | tiger-lab/general-reasoner-qwen3-4b | 32.5 |
| 11 | essentialai/rnj-1-instruct | 28.2 |
| 12 | qwen/qwen2.5-coder-7b-instruct | 23.5 |
| 13 | openbmb/minicpm5-1b | 23.14 |
| 14 | qwen/qwen3-1.7b-base | 20.92 |
| 15 | qwen/qwen2.5-1.5b | 17.64 |
| 16 | qwen/qwen3.5-0.8b | 16.9 |
| 17 | qwen/qwen3-0.6b-base | 15.03 |
| 18 | qwen/qwen2.5-0.5b | 11.3 |
| 19 | google/gemma-3-1b-pt | 7.19 |