-(CoderPPL)
31 models
CoderPPL — a code perplexity (PPL) benchmark measuring how well a language model predicts real-world, hand-written code across 5 languages (Python, JavaScript, HTML, Lua, Shell) and diverse programming domains. Unlike natural-language PPL, code PPL tests syntax understanding, structural reasoning, and domain-specific patterns.
Top 5 Models Performance
| qwen/qwen3.5-9b | ################################## | -1.6914 |
| qwen/qwen3.5-9b-base | ################################### | -1.7468 |
| qwen/qwen3.5-4b | #################################### | -1.7894 |
| yandex/gpt-5-lite | ######################################## | -2 |
| qwen/qwen3.5-2b | ######################################## | -2.0063 |