πŸ† ClawBench β€” Web Agent Benchmark

Can AI agents complete everyday online tasks? ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: V1 β€” 153 tasks across 144 websites Β· V2 β€” 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request interception check (Stage 1, the sort key) β€” then an LLM judge on the intercepted payload (Stage 2 = Reward).

πŸ“– Paper Β· πŸ’» GitHub Β· πŸ—‚ Dataset Β· 🎞 Traces V1 Β· 🎞 Traces V2 Β· 🌐 Site

Intercepted (sort key) = agent's final HTTP request matched the task's URL/method schema β€” Stage 1, deterministic, no judge. Reward = additionally requires the LLM judge (default deepseek/deepseek-v4-pro) to confirm the payload fulfilled the instruction β€” Stage 2. Rows are ranked by Intercepted (corpus-normalized: intercepted / 130 for V2 so partials don't outrank complete batches) with Reward as tiebreak. β€” = no Stage-2 data yet.

Corpus
Harness
Rank
Model
Harness
Corpus
Intercepted
Reward
Pass
Total
Wall (h)
1
openrouter-owl-alpha
openclaw
v2
48.46%
18.46%
24
130
β€”