v1.0 · 150 tasks
PawBench
How (Model × Harness) combinations perform on production-grade tasks
Agent Performance = f(Model, Harness) The same 150 tasks across (Model × Harness) combinations — read each axis independently and separate the model's contribution from the harness's.
150 Tasks 6 Sources 3 Harnesses 7 Capabilities
Model × Harness Score Matrix
All 150 tasks (text + multimodal)
| Model | Hermes v2026.4.23 | OpenClaw v2026.4.24 | QwenPaw v1.1.3 | Avg | Δ |
|---|---|---|---|---|---|
| claude-opus-4.6 | 78.4 | 76.1 | 78.3 | 77.6 | +2.3 |
| deepseek-v4-pro | 72.1 | 76.9 | 75.6 | 74.9 | +4.9 |
| qwen3.6-max-previewtext-only | 68.5 | 77.2 | 78.9 | 74.9 | +10.3 |
| qwen3.7-maxtext-only | 72.3 | 72.5 | 77.6 | 74.1 | +5.4 |
| qwen3.6-plus | 71.4 | 73.6 | 76.5 | 73.8 | +5.2 |
| qwen3.6-27b | 69.6 | 73.4 | 73.1 | 72.1 | +3.8 |
| glm-5.1text-only | 66.8 | 68.5 | 76.7 | 70.6 | +9.9 |
| kimi-k2.6 | 67.7 | 69.3 | 68.9 | 68.6 | +1.6 |
| qwen3.6-35b-a3b | 56.7 | 68.3 | 68.3 | 64.4 | +11.5 |
| Avg | 69.3 | 72.9 | 74.9 | 72.3 |
Leaderboard
| # | Model | Harness | Overall ↓ | Automated | LLM Judge | Tasks | Updated |
|---|---|---|---|---|---|---|---|
| 1 | qwen3.6-max-preview | QwenPaw | 78.9 | 87.2 | 81.1 | 149 | 2026-05-29 |
| 2 | claude-opus-4.6 | Hermes | 78.4 | 82.6 | 90.8 | 150 | 2026-05-29 |
| 3 | claude-opus-4.6 | QwenPaw | 78.3 | 85.3 | 83.9 | 150 | 2026-05-29 |
| 4 | qwen3.7-max | QwenPaw | 77.6 | 84.6 | 82.9 | 150 | 2026-05-29 |
| 5 | qwen3.6-max-preview | OpenClaw | 77.2 | 84.4 | 81.7 | 146 | 2026-05-29 |
| 6 | deepseek-v4-pro | OpenClaw | 76.9 | 83.5 | 80.7 | 147 | 2026-05-29 |
| 7 | glm-5.1 | QwenPaw | 76.7 | 85.0 | 83.0 | 139 | 2026-05-29 |
| 8 | qwen3.6-plus | QwenPaw | 76.5 | 84.6 | 79.1 | 147 | 2026-05-29 |
| 9 | claude-opus-4.6 | OpenClaw | 76.1 | 83.6 | 80.7 | 150 | 2026-05-29 |
| 10 | deepseek-v4-pro | QwenPaw | 75.6 | 83.7 | 80.6 | 150 | 2026-05-29 |
| 11 | qwen3.6-plus | OpenClaw | 73.6 | 82.3 | 77.2 | 150 | 2026-05-29 |
| 12 | qwen3.6-27b | OpenClaw | 73.4 | 82.2 | 77.5 | 149 | 2026-05-29 |
| 13 | qwen3.6-27b | QwenPaw | 73.1 | 83.8 | 77.6 | 149 | 2026-05-29 |
| 14 | qwen3.7-max | OpenClaw | 72.5 | 79.3 | 75.9 | 150 | 2026-05-29 |
| 15 | qwen3.7-max | Hermes | 72.3 | 80.4 | 79.9 | 150 | 2026-05-29 |
| 16 | deepseek-v4-pro | Hermes | 72.1 | 81.2 | 79.0 | 150 | 2026-05-29 |
| 17 | qwen3.6-plus | Hermes | 71.4 | 80.5 | 76.6 | 148 | 2026-05-29 |
| 18 | qwen3.6-27b | Hermes | 69.6 | 78.7 | 75.0 | 147 | 2026-05-29 |
| 19 | kimi-k2.6 | OpenClaw | 69.3 | 79.3 | 70.9 | 144 | 2026-05-29 |
| 20 | kimi-k2.6 | QwenPaw | 68.9 | 80.1 | 69.0 | 145 | 2026-05-29 |
| 21 | qwen3.6-max-preview | Hermes | 68.5 | 76.8 | 77.6 | 149 | 2026-05-29 |
| 22 | glm-5.1 | OpenClaw | 68.5 | 72.6 | 74.5 | 150 | 2026-05-29 |
| 23 | qwen3.6-35b-a3b | QwenPaw | 68.3 | 77.8 | 68.2 | 150 | 2026-05-29 |
| 24 | qwen3.6-35b-a3b | OpenClaw | 68.3 | 77.6 | 70.8 | 149 | 2026-05-29 |
| 25 | kimi-k2.6 | Hermes | 67.7 | 78.7 | 70.3 | 147 | 2026-05-29 |
| 26 | glm-5.1 | Hermes | 66.8 | 75.3 | 71.3 | 142 | 2026-05-29 |
| 27 | qwen3.6-35b-a3b | Hermes | 56.7 | 65.8 | 61.5 | 150 | 2026-05-29 |
How it works
Three steps from model to score
- 1
Pick a model and harness
Any OpenAI-compatible endpoint or local model; harness can be QwenPaw / OpenClaw / Hermes
- 2
Run in Docker
Each task runs in an isolated container with workspace files mounted, strict timeouts and retries
- 3
Automated + LLM grading
Python grade() functions plus an LLM judge; hybrid tasks zero the LLM share if automated < 0.75