πŸ† ClawBench β€” Web Agent Benchmark

Can AI agents complete everyday online tasks? ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: V1 β€” 153 tasks across 144 websites Β· V2 β€” 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request interception check (Stage 1, the sort key) β€” then an LLM judge on the intercepted payload (Stage 2 = Reward).

πŸ“– Paper Β· πŸ’» GitHub Β· πŸ—‚ Dataset Β· 🎞 Traces V1 Β· 🎞 Traces V2 Β· 🌐 Site

Intercepted (Stage 1) = agent's final HTTP request matched the task's URL/method schema β€” deterministic, no judge. Reward (lenient) (Stage 2, headline metric, default sort key) = judge confirms the intercepted payload fulfilled the instruction under the default rubric (no explicit contradiction β†’ match). Reward (strict) = same judge (default deepseek/deepseek-v4-pro) under the stricter rubric (ambiguous β†’ mismatch), shown for ablation. Rows are ranked by Reward (lenient) DESC, then Intercepted DESC as tiebreak. V2 is Hermes-only; alternative harnesses are evaluated separately. Partial = batch attempted fewer than the full corpus (mid-run abort / queue cap); rates are over attempted, not over corpus.

Corpus
Harness
1
2
3