Note: Overall leaderboard rankings may not reflect true model quality — individual benchmarks give a clearer picture. ARC-Challenge MMLU GPQA GSM8K Artificial Analysis Intelligence Index v4.0
← Back to leaderboard

-(CoderPPL)

31 models

CoderPPL — a code perplexity (PPL) benchmark measuring how well a language model predicts real-world, hand-written code across 5 languages (Python, JavaScript, HTML, Lua, Shell) and diverse programming domains. Unlike natural-language PPL, code PPL tests syntax understanding, structural reasoning, and domain-specific patterns.

Top 5 Models Performance

qwen/qwen3.5-9b ################################## -1.6914
qwen/qwen3.5-9b-base ################################### -1.7468
qwen/qwen3.5-4b #################################### -1.7894
yandex/gpt-5-lite ######################################## -2
qwen/qwen3.5-2b ######################################## -2.0063
970 – 1.1T
2019 – 2026
Rank Model Score
🥇 qwen/qwen3.5-9b -1.6914
🥈 qwen/qwen3.5-9b-base -1.7468
🥉 qwen/qwen3.5-4b -1.7894
4 yandex/gpt-5-lite -2
5 qwen/qwen3.5-2b -2.0063
6 locoremind/locooperator-4b -2.0148
7 deepseek-ai/deepseek-coder-1.3b-instruct -2.0481
8 mistralai/ministral-3-3b-instruct-2512 -2.134
9 qwen/qwen3-4b -2.2233
10 qwen/qwen3.5-0.8b -2.313
11 openai/gpt-oss-20b -2.3451
12 huggingfacetb/smollm3-3b-base -2.4737
13 google/gemma-4-e4b-it -2.6205
14 meta-llama/llama-3.2-3b-instruct -2.7202
15 huggingfacetb/smollm2-360m -2.7327
16 qwen/qwen2.5-0.5b -2.7341
17 qwen/qwen2.5-0.5b-instruct -2.9416
18 openbmb/minicpm5-1b -2.9929
19 liquid/lfm-2.5-1.2b-instruct -3.0582
20 tiiuae/falcon-h1-tiny-coder-90m -3.4801
21 liquid/lfm-2.5-8b-a1b -3.5556
22 sapbot/grok-4-distill-smollm2-135m -3.6157
23 openai-community/gpt2 -4.0491
24 amd/amd-llama-135m -4.0935
25 bigscience/bloom-560m -6.9102
26 tencent/youtu-llm-2b -9.4946
27 liquid/lfm-2.5-350m -10.5455
28 raincandy-u/rain-100m -12.2946
29 pleias/baguettotron -19.0947
30 sapbot/toyllama-50m -37.0286
31 arnir0/tiny-llm -43.49689