TauBench V2 (Average)
9 models
τ²-bench (Tau Squared Bench) is an extended version of τ-bench designed to evaluate AI agents in dynamic conversational tool‑use environments. Tasks are framed as customer‑service scenarios across three domains: airline, retail, and telecom. Both the agent and a simulated user can call APIs to read and write to a shared world state. The key metric—TauBench V2 (Average)—reports the average task completion rate across the three domains, measuring an agent’s reliability in following business policies and effectively using tools without access to vision or diffusion models.
Top 10 Models Performance
| google/gemini-3.1-pro-preview | ######################################## | 99.3 |
| x-ai/grok-4.3 | ####################################### | 98 |
| anthropic/claude-opus-4.8 | ###################################### | 94 |
| google/gemma-4-31b-it | ############################### | 76.9 |
| zai-org/glm-5.1 | ############################ | 70.6 |
| nvidia/nvidia-nemotron-3-nano-30b-a3b | #################### | 49 |
| google/gemma-4-e4b-it | ################# | 42.2 |
| tencent/youtu-llm-2b | ###### | 15 |
| qwen/qwen3-4b | #### | 10.9 |
| Rank | Model | Score |
|---|---|---|
| 🥇 | google/gemini-3.1-pro-preview | 99.3 |
| 🥈 | x-ai/grok-4.3 | 98 |
| 🥉 | anthropic/claude-opus-4.8 | 94 |
| 4 | google/gemma-4-31b-it | 76.9 |
| 5 | zai-org/glm-5.1 | 70.6 |
| 6 | nvidia/nvidia-nemotron-3-nano-30b-a3b | 49 |
| 7 | google/gemma-4-e4b-it | 42.2 |
| 8 | tencent/youtu-llm-2b | 15 |
| 9 | qwen/qwen3-4b | 10.9 |