v1.0 · 150 tasks
PawBench
评估 (Model × Harness) 在生产环境任务的表现
Agent 表现 = f(模型, Harness) 同一组 150 任务跑遍多个 (Model × Harness) 组合,独立观察两条轴——分清模型和 Harness 各自的贡献。
150 任务 6 子数据集 3 Agent Harness 7 能力维度
Model × Harness 评分矩阵
全部 150 个任务(文本 + 多模态)
| Model | Hermes v2026.4.23 | OpenClaw v2026.4.24 | QwenPaw v1.1.3 | 平均 | Δ |
|---|---|---|---|---|---|
| claude-opus-4.6 | 78.4 | 76.1 | 78.3 | 77.6 | +2.3 |
| deepseek-v4-pro | 72.1 | 76.9 | 75.6 | 74.9 | +4.9 |
| qwen3.6-max-preview纯文本 | 68.5 | 77.2 | 78.9 | 74.9 | +10.3 |
| qwen3.7-max纯文本 | 72.3 | 72.5 | 77.6 | 74.1 | +5.4 |
| qwen3.6-plus | 71.4 | 73.6 | 76.5 | 73.8 | +5.2 |
| qwen3.6-27b | 69.6 | 73.4 | 73.1 | 72.1 | +3.8 |
| glm-5.1纯文本 | 66.8 | 68.5 | 76.7 | 70.6 | +9.9 |
| kimi-k2.6 | 67.7 | 69.3 | 68.9 | 68.6 | +1.6 |
| qwen3.6-35b-a3b | 56.7 | 68.3 | 68.3 | 64.4 | +11.5 |
| 平均 | 69.3 | 72.9 | 74.9 | 72.3 |
Leaderboard
| # | Model | Harness | Overall ↓ | Automated | LLM Judge | Tasks | Updated |
|---|---|---|---|---|---|---|---|
| 1 | qwen3.6-max-preview | QwenPaw | 78.9 | 87.2 | 81.1 | 149 | 2026-05-29 |
| 2 | claude-opus-4.6 | Hermes | 78.4 | 82.6 | 90.8 | 150 | 2026-05-29 |
| 3 | claude-opus-4.6 | QwenPaw | 78.3 | 85.3 | 83.9 | 150 | 2026-05-29 |
| 4 | qwen3.7-max | QwenPaw | 77.6 | 84.6 | 82.9 | 150 | 2026-05-29 |
| 5 | qwen3.6-max-preview | OpenClaw | 77.2 | 84.4 | 81.7 | 146 | 2026-05-29 |
| 6 | deepseek-v4-pro | OpenClaw | 76.9 | 83.5 | 80.7 | 147 | 2026-05-29 |
| 7 | glm-5.1 | QwenPaw | 76.7 | 85.0 | 83.0 | 139 | 2026-05-29 |
| 8 | qwen3.6-plus | QwenPaw | 76.5 | 84.6 | 79.1 | 147 | 2026-05-29 |
| 9 | claude-opus-4.6 | OpenClaw | 76.1 | 83.6 | 80.7 | 150 | 2026-05-29 |
| 10 | deepseek-v4-pro | QwenPaw | 75.6 | 83.7 | 80.6 | 150 | 2026-05-29 |
| 11 | qwen3.6-plus | OpenClaw | 73.6 | 82.3 | 77.2 | 150 | 2026-05-29 |
| 12 | qwen3.6-27b | OpenClaw | 73.4 | 82.2 | 77.5 | 149 | 2026-05-29 |
| 13 | qwen3.6-27b | QwenPaw | 73.1 | 83.8 | 77.6 | 149 | 2026-05-29 |
| 14 | qwen3.7-max | OpenClaw | 72.5 | 79.3 | 75.9 | 150 | 2026-05-29 |
| 15 | qwen3.7-max | Hermes | 72.3 | 80.4 | 79.9 | 150 | 2026-05-29 |
| 16 | deepseek-v4-pro | Hermes | 72.1 | 81.2 | 79.0 | 150 | 2026-05-29 |
| 17 | qwen3.6-plus | Hermes | 71.4 | 80.5 | 76.6 | 148 | 2026-05-29 |
| 18 | qwen3.6-27b | Hermes | 69.6 | 78.7 | 75.0 | 147 | 2026-05-29 |
| 19 | kimi-k2.6 | OpenClaw | 69.3 | 79.3 | 70.9 | 144 | 2026-05-29 |
| 20 | kimi-k2.6 | QwenPaw | 68.9 | 80.1 | 69.0 | 145 | 2026-05-29 |
| 21 | qwen3.6-max-preview | Hermes | 68.5 | 76.8 | 77.6 | 149 | 2026-05-29 |
| 22 | glm-5.1 | OpenClaw | 68.5 | 72.6 | 74.5 | 150 | 2026-05-29 |
| 23 | qwen3.6-35b-a3b | QwenPaw | 68.3 | 77.8 | 68.2 | 150 | 2026-05-29 |
| 24 | qwen3.6-35b-a3b | OpenClaw | 68.3 | 77.6 | 70.8 | 149 | 2026-05-29 |
| 25 | kimi-k2.6 | Hermes | 67.7 | 78.7 | 70.3 | 147 | 2026-05-29 |
| 26 | glm-5.1 | Hermes | 66.8 | 75.3 | 71.3 | 142 | 2026-05-29 |
| 27 | qwen3.6-35b-a3b | Hermes | 56.7 | 65.8 | 61.5 | 150 | 2026-05-29 |
工作流程
从模型到分数,三步打通
- 1
选择模型与 harness
支持 OpenAI 兼容 API 与本地模型;harness 可选 QwenPaw / OpenClaw / Hermes
- 2
Docker 隔离运行
每个任务在独立容器中执行,workspace 文件按需挂载,严格超时与重试
- 3
自动 + LLM 评分
Python grade() 函数与 LLM judge 双重评分,hybrid 任务在自动分 < 0.75 时直接清零