PawBench
v1.0 · 150 tasks

PawBench

How (Model × Harness) combinations perform on production-grade tasks

Agent Performance = f(Model, Harness)

The same 150 tasks across (Model × Harness) combinations — read each axis independently and separate the model's contribution from the harness's.

150 Tasks 6 Sources 3 Harnesses 7 Capabilities

Model × Harness Score Matrix

All 150 tasks (text + multimodal)

Model
Hermes
v2026.4.23
OpenClaw
v2026.4.24
QwenPaw
v1.1.3
AvgΔ
claude-opus-4.6
78.4
76.1
78.3
77.6
+2.3
deepseek-v4-pro
72.1
76.9
75.6
74.9
+4.9
qwen3.6-max-previewtext-only
68.5
77.2
78.9
74.9
+10.3
qwen3.7-maxtext-only
72.3
72.5
77.6
74.1
+5.4
qwen3.6-plus
71.4
73.6
76.5
73.8
+5.2
qwen3.6-27b
69.6
73.4
73.1
72.1
+3.8
glm-5.1text-only
66.8
68.5
76.7
70.6
+9.9
kimi-k2.6
67.7
69.3
68.9
68.6
+1.6
qwen3.6-35b-a3b
56.7
68.3
68.3
64.4
+11.5
Avg
69.3
72.9
74.9
72.3

Leaderboard

#ModelHarnessOverallAutomatedLLM JudgeTasksUpdated
1qwen3.6-max-previewQwenPaw78.987.281.11492026-05-29
2claude-opus-4.6Hermes78.482.690.81502026-05-29
3claude-opus-4.6QwenPaw78.385.383.91502026-05-29
4qwen3.7-maxQwenPaw77.684.682.91502026-05-29
5qwen3.6-max-previewOpenClaw77.284.481.71462026-05-29
6deepseek-v4-proOpenClaw76.983.580.71472026-05-29
7glm-5.1QwenPaw76.785.083.01392026-05-29
8qwen3.6-plusQwenPaw76.584.679.11472026-05-29
9claude-opus-4.6OpenClaw76.183.680.71502026-05-29
10deepseek-v4-proQwenPaw75.683.780.61502026-05-29
11qwen3.6-plusOpenClaw73.682.377.21502026-05-29
12qwen3.6-27bOpenClaw73.482.277.51492026-05-29
13qwen3.6-27bQwenPaw73.183.877.61492026-05-29
14qwen3.7-maxOpenClaw72.579.375.91502026-05-29
15qwen3.7-maxHermes72.380.479.91502026-05-29
16deepseek-v4-proHermes72.181.279.01502026-05-29
17qwen3.6-plusHermes71.480.576.61482026-05-29
18qwen3.6-27bHermes69.678.775.01472026-05-29
19kimi-k2.6OpenClaw69.379.370.91442026-05-29
20kimi-k2.6QwenPaw68.980.169.01452026-05-29
21qwen3.6-max-previewHermes68.576.877.61492026-05-29
22glm-5.1OpenClaw68.572.674.51502026-05-29
23qwen3.6-35b-a3bQwenPaw68.377.868.21502026-05-29
24qwen3.6-35b-a3bOpenClaw68.377.670.81492026-05-29
25kimi-k2.6Hermes67.778.770.31472026-05-29
26glm-5.1Hermes66.875.371.31422026-05-29
27qwen3.6-35b-a3bHermes56.765.861.51502026-05-29

How it works

Three steps from model to score

  1. 1

    Pick a model and harness

    Any OpenAI-compatible endpoint or local model; harness can be QwenPaw / OpenClaw / Hermes

  2. 2

    Run in Docker

    Each task runs in an isolated container with workspace files mounted, strict timeouts and retries

  3. 3

    Automated + LLM grading

    Python grade() functions plus an LLM judge; hybrid tasks zero the LLM share if automated < 0.75