Note: Overall leaderboard rankings may not reflect true model quality — individual benchmarks give a clearer picture. ARC-Challenge MMLU GPQA GSM8K Artificial Analysis Intelligence Index
← Back to leaderboard

WikiText-2 (-ppl)

20 models

This benchmark is measuring capability of AI to write Wikipedia-styled articles.

Top 10 Models Performance

qwen/qwen3.5-9b ################ -8.324
samfatnassi/kilma-v1-base ################## -9.0167
huggingfacetb/smollm3-3b ################### -9.7604
huggingfacetb/smollm2-360m ######################## -12.1571
huggingfacetb/smollm2-360m-instruct ########################## -13.6027
qwen/qwen2.5-0.5b ############################ -14.3732
qwen/qwen3-0.6b-base ############################ -14.639
google/gemma-3-4b-it ################################ -16.7015
qwen/qwen2.5-0.5b-instruct ################################## -17.493
qwen/qwen3.5-0.8b ######################################## -20.5747
Rank Model Score
🥇 qwen/qwen3.5-9b -8.324
🥈 samfatnassi/kilma-v1-base -9.0167
🥉 huggingfacetb/smollm3-3b -9.7604
4 huggingfacetb/smollm2-360m -12.1571
5 huggingfacetb/smollm2-360m-instruct -13.6027
6 qwen/qwen2.5-0.5b -14.3732
7 qwen/qwen3-0.6b-base -14.639
8 google/gemma-3-4b-it -16.7015
9 qwen/qwen2.5-0.5b-instruct -17.493
10 qwen/qwen3.5-0.8b -20.5747
11 liquid/lfm-2.5-1.2b-instruct -21.9587
12 google/gemma-4-e4b -23.1498
13 qwen/qwen3-0.6b -24.0627
14 openai-community/gpt2 -28.8355
15 liquid/lfm-2.5-350m -47.55099
16 appvoid/carbono-001 -51.4911
17 google/gemma-3-270m-it -65.6058
18 google/gemma-4-e4b-it -68.255
19 raincandy-u/rain-100m -107.9683
20 sapbot/toyllama-50m -405.2885