π ClawBench β Web Agent Benchmark
Can AI agents complete everyday online tasks? ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: V1 β 153 tasks across 144 websites Β· V2 β 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request interception check (Stage 1, the sort key) β then an LLM judge on the intercepted payload (Stage 2 = Reward).
π Paper Β· π» GitHub Β· π Dataset Β· π Traces V1 Β· π Traces V2 Β· π Site
Intercepted (sort key) = agent's final HTTP request matched the task's URL/method schema β Stage 1, deterministic, no judge. Reward = additionally requires the LLM judge (default deepseek/deepseek-v4-pro) to confirm the payload fulfilled the instruction β Stage 2. Rows are ranked by Intercepted (corpus-normalized: intercepted / 130 for V2 so partials don't outrank complete batches) with Reward as tiebreak. β = no Stage-2 data yet.
Rank | Model | Harness | Corpus | Intercepted | Reward | Pass | Total | Wall (h) |
|---|---|---|---|---|---|---|---|---|
1 | openrouter-owl-alpha | openclaw | v2 | 48.46% | 18.46% | 24 | 130 | β |
About ClawBench
Why a new benchmark?
Existing browser-agent benchmarks either run on synthetic / sandboxed websites (WebArena, VisualWebArena) or only check whether the agent reached the endpoint (WebVoyager). ClawBench runs on live, real-world websites and verifies the payload the agent submitted β so an agent that types the wrong delivery address into Uber Eats fails, even if its last HTTP request hit the correct endpoint.
Two corpora
- V1 β 153 tasks across 144 real websites (the paper).
- V2 β 130 newer everyday tasks across 63 platforms, expanded coverage of e-commerce / form-filling / authentication-walled flows.
Two-stage scoring
| Stage | What it checks | Output |
|---|---|---|
| 1. Interception | Did the final HTTP request match the task's URL + method + canonical body schema? | intercepted β {true, false} |
| 2. Judge | Given the natural-language instruction and the intercepted payload, did the agent submit the right thing? | match β {true, false, null} |
Reward = Intercepted β§ Match. Full prompt + judge model details: eval/scoring.md β
What ships with every run
A 5-layer trace bundle (downloadable from the Traces datasets above):
recording.mp4β full browser session videoactions.jsonlβ every click / type / scrollagent-messages.jsonlβ model inputs & outputs (incl. reasoning)requests.jsonlβ every HTTP request the page madeinterception.jsonβ graded final requestrun-meta.jsonβ model, harness, scores, timing
Reproducing
pip install clawbench-eval
clawbench run --model <your-model> --harness hermes --corpus v2
python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>
π Submit your model
Submissions are accepted as PRs to the leaderboard CSV in the dataset repo:
Open the CSV in the dataset repo β
Required steps
- Run the benchmark β install
pip install clawbench-eval, thenclawbench run --model <your-model> --harness hermes --corpus v2(orv1). Use the included harnesses (hermes / openclaw) so traces follow the standard 5-layer format. - Score β
python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>producesrescore-summary.jsonwith the cells you'll need. - Upload traces (recommended) β push the 5-layer run bundles to
TIGER-Lab/ClawBenchV2Trace(orNAIL-Group/ClawBenchV1Trace) so others can audit. - Open a PR β add one row per
(model, harness, corpus)toleaderboard/results.csvwith columns:model,harness,dataset,passed,total,pass_rate,reward_rate,wall_hours. Link the trace bundle in the PR description.
We re-run a sample of your submitted traces with our judge before merging β to keep the table honest.
For step-by-step instructions, see eval/scoring.md.