Vercel Sandbox
5 harnesses
10 suites
Agent Benchmark Harness
Public leaderboard for Claude Code, Codex, OpenCode, Eve, and Mastra runs across coding, terminal, browser, OS, tool-use, finance, and skills benchmarks.
Completed Cells
0
0 failed
Average Score
n/a
completed cells
Average Runtime
n/a
completed cells
Recent Runs
0
last 50 stored runs
Top Scores
Highest normalized scores by suite, harness, and model.
Recent Runs
Latest run status and cell completion.
Harness Coverage
Average score by suite and harness, with sample counts in each cell.
Suite
claude-code
codex
opencode
eve
mastra
swe-bench-verified
n/a0
n/a0
n/a0
n/a0
n/a0
swe-bench-pro
n/a0
n/a0
n/a0
n/a0
n/a0
terminal-bench-2.1
n/a0
n/a0
n/a0
n/a0
n/a0
osworld-verified
n/a0
n/a0
n/a0
n/a0
n/a0
browsecomp
n/a0
n/a0
n/a0
n/a0
n/a0
mcp-atlas
n/a0
n/a0
n/a0
n/a0
n/a0
tau2-telecom
n/a0
n/a0
n/a0
n/a0
n/a0
finance-agent-v2
n/a0
n/a0
n/a0
n/a0
n/a0
vibe-code-bench-1.1
n/a0
n/a0
n/a0
n/a0
n/a0
skillsbench
n/a0
n/a0
n/a0
n/a0
n/a0
Leaderboard
Normalized result rows from completed and failed cells.
| Model | Harness | Suite | Score | Pass Rate | Cells | Avg Runtime |
|---|