Agent Benchmark Harness

Public leaderboard for Claude Code, Codex, OpenCode, Eve, and Mastra runs across coding, terminal, browser, OS, tool-use, finance, and skills benchmarks.

JSON Start API

Completed Cells

0 failed

Average Score

n/a

completed cells

Average Runtime

n/a

completed cells

Recent Runs

last 50 stored runs

Top Scores

Highest normalized scores by suite, harness, and model.

Recent Runs

Latest run status and cell completion.

Harness Coverage

Average score by suite and harness, with sample counts in each cell.

Suite

claude-code

codex

opencode

eve

mastra

swe-bench-verified

n/a0

swe-bench-pro

n/a0

terminal-bench-2.1

n/a0

osworld-verified

n/a0

browsecomp

n/a0

mcp-atlas

n/a0

tau2-telecom

n/a0

finance-agent-v2

n/a0

vibe-code-bench-1.1

n/a0

skillsbench

n/a0

Leaderboard

Normalized result rows from completed and failed cells.

Model	Harness	Suite	Score	Pass Rate	Cells	Avg Runtime