Live data ยท Updated hourly

PinchBench โ€” Real-World AI Agent Benchmarks

How do AI models perform on real agent tasks? PinchBench scores 511+ models across coding, reasoning, tool use, and instruction following โ€” with live pricing data.

Models Tested
511
Scenarios
6
Avg Score
32.4
Best Value
Qwen3.5 0.8B (Non-reasoning)
โญ Overall

Balanced score across all agent capabilities

intelligence index (15%)coding index (15%)math index (10%)gpqa (10%)livecodebench (10%)ifbench (10%)tau2 (10%)terminalbench hard (10%)hle (10%)
๐Ÿฅ‡#168.4
OpenAI

GPT-5.5 (xhigh)

Price
$11.25
Speed
60
Efficiency
6.1
๐Ÿฅˆ#267.1
OpenAI

GPT-5.5 (high)

Price
$11.25
Speed
63
Efficiency
6.0
๐Ÿฅ‰#367.1
OpenAI

GPT-5.2 (xhigh)

Price
$4.81
Speed
67
Efficiency
13.9
#ModelScoreBarInput $/MOutput $/MSpeedTTFTEfficiency
1
GPT-5.5 (xhigh)
OpenAI
68.4
$5.00$30.006045.73s6.1
2
GPT-5.5 (high)
OpenAI
67.1
$5.00$30.006320.49s6.0
3
GPT-5.2 (xhigh)
OpenAI
67.1
$1.75$14.006768.51s13.9
4
Gemini 3.1 Pro Preview
Google
66.8
$2.00$12.0013519.85s14.8
5
Gemini 3 Pro Preview (high)
Google
65.7
$2.00$12.0014125.18s14.6
6
GPT-5.4 (xhigh)
OpenAI
65.4
$2.50$15.0079150.61s11.6
7
GPT-5.5 (medium)
OpenAI
65.4
$5.00$30.00594.21s5.8
8
Gemini 3 Flash Preview (Reasoning)
Google
64.3
$0.50$3.001966.17s57.1
9
Claude Opus 4.5 (Reasoning)
Anthropic
63.4
$6.25$25.005411.25s5.8
10
GPT-5.1 (high)
OpenAI
63.3
$1.25$10.0012022.17s18.4
11
GPT-5.3 Codex (xhigh)
OpenAI
63.2
$1.75$14.007356.09s13.1
12
Gemini 3.5 Flash (high)
Google
62.0
$1.50$9.002279.75s18.4
13
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Anthropic
61.8
$6.25$25.004924.39s5.7
14
Kimi K2.6
Kimi
61.8
$0.95$4.00951.24s36.1
15
GPT-5.2 (medium)
OpenAI
61.6
$1.75$14.00โ€”โ€”12.8
16
GPT-5 Codex (high)
OpenAI
61.6
$1.25$10.001667.06s17.9
17
DeepSeek V4 Pro (Reasoning, Max Effort)
DeepSeek
61.5
$1.74$3.48291.15s28.3
18
Muse Spark
Meta
61.3
โ€”โ€”โ€”โ€”โ€”
19
GLM-4.7 (Reasoning)
Z AI
60.9
$0.60$2.20850.81s60.9
20
MiMo-V2.5-Pro
Xiaomi
60.8
$1.00$3.00542.28s40.5

๐Ÿ’ฐ Best Cost Efficiency โ€” Overall

Score per dollar (higher = better value). Only models with pricing data.

1
Qwen3.5 0.8B (Non-reasoning)
822.6$0.02
2
Qwen3.5 4B (Reasoning)
654.3$0.06
3
Qwen3.5 0.8B (Reasoning)
607.5$0.02
4
Qwen3.5 2B (Non-reasoning)
601.8$0.04
5
Qwen3.5 2B (Reasoning)
567.8$0.04
6
Qwen3.5 4B (Non-reasoning)
553.1$0.06
7
gpt-oss-20B (high)
506.9$0.09
8
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
460.0$0.10
9
Gemma 3n E4B Instruct
455.7$0.03
10
NVIDIA Nemotron Nano 9B V2 (Reasoning)
413.4$0.07

โšก Score vs Speed โ€” Overall

Models in the top-right are both fast and capable.

Inception
Mercury 2
Score
44.3
Speed
818
IBM
Granite 3.3 8B (Non-reasoning)
Score
10.6
Speed
426
IBM
Granite 4.0 H Small
Score
16.4
Speed
364
OpenAI
gpt-oss-120b (high)
Score
52.9
Speed
272
Google
Gemini 3.1 Flash-Lite Preview
Score
40.8
Speed
288
Google
Gemini 3.5 Flash (high)
Score
62.0
Speed
227
OpenAI
gpt-oss-20B (high)
Score
44.6
Speed
270
Alibaba
Qwen3.5 2B (Non-reasoning)
Score
24.1
Speed
318
NVIDIA
Nemotron 3 Nano Omni 30B A3B Reasoning
Score
27.9
Speed
307
OpenAI
gpt-oss-120b (low)
Score
37.8
Speed
276

Frequently Asked Questions

What is PinchBench and how does it differ from traditional benchmarks?

PinchBench evaluates AI models on real-world agent tasks spanning coding, reasoning, tool use, and instruction following. Unlike academic benchmarks that test isolated capabilities, PinchBench combines multiple benchmark dimensions to reflect how models perform as autonomous agents in practical workflows.

Which scenarios does PinchBench test?

PinchBench covers 6 scenarios: Coding Agent (code generation, debugging, terminal use), Reasoning & Logic (math, science, multi-step problems), Instruction Following (format compliance, structured output), Research & Analysis (scientific reasoning, knowledge), Tool Use & Agentic (multi-turn orchestration, planning), and an Overall balanced score.

How are scores calculated?

Each scenario uses a weighted combination of relevant benchmarks. For example, Coding Agent combines LiveCodeBench, TerminalBench, SciCode, and the Artificial Analysis Coding Index. Scores are normalized to 0-100. Cost efficiency is calculated as score divided by price per million tokens.

Why do real-world results differ from academic benchmarks?

Academic benchmarks test specific skills in controlled conditions. Real agent tasks require combining multiple skills โ€” a model might score well on individual benchmarks but struggle when tasks require coding + tool use + instruction following simultaneously. PinchBench's weighted scenario scores better approximate this combined performance.

How often is the data updated?

PinchBench data refreshes hourly from the Artificial Analysis API, ensuring you see the latest benchmark scores and pricing for all models.