Benchmark — Prove it holds up
Benchmark
The Benchmark proves which agent and model actually hold up on the work your teams really do, on your own hardware, before you commit. It evaluates complete agent systems, not individual models in isolation.
Coming soon
What it measures
Six dimensions, end to end
Every build task is scored across six dimensions, so a result is more than just "did it run".
01
Correctness
Does it build and do its tests pass? Compilation, the share of passing tests and coverage against the pinned baseline.
02
Architecture
How clean is the structure? Static analysis, cyclomatic complexity, maintainability and adherence to the plan's guidelines.
03
Robustness
Does it hold up beyond the happy path? Type safety, dependency health and resilience that passing tests alone miss.
04
Efficiency
Is the solution lean? Lines of code, complexity and runtime behaviour, so a working result isn't needlessly bloated.
05
Auditability
Can a human follow what was done? Docstring coverage, dead-code checks and a full trail of transcripts and diffs.
06
Security
Is it safe to ship? An automated security review plus scans on the produced patch, kept separate from the functional score.
Agent, model, hardware, task
Step 1
Define the task: prompt, expected files, build and tests.
Step 2
Pick an agent, a model and a compute target.
Step 3
Run the agent in an isolated workspace.
Step 4
Score with hard metrics and human-style review.
Step 5
Compare on the leaderboard and over time.
Capabilities
Built for an honest decision
01
Agent + model + hardware
Every run pairs a coding agent, a model and a compute target, because the system as a whole is what matters, not the model alone.
02
Hard metrics vs. judgement
Mechanical checks (build, tests, static analysis, security scans) are kept separate from LLM-as-a-judge review, never blended into one misleading number.
03
Scenario tiers
Run the same use case at rising levels of difficulty and abstraction, to see where a model starts to struggle.
04
Help tiers
Run a model with no, some, or maximum context. The spread shows how much hand-holding it needs, which explains why a model suits one team and not another.
05
Your own tasks, kept private
Bring your own repositories and coding challenges. Testing on your private stack sidesteps the contamination of public leaderboards.
06
Reproducible & versioned
Pinned commits and a saved configuration snapshot make every run reproducible, so you can re-test on each new model release.
07
Read-only and write channels
Observe shared GPUs without disturbing them, or deploy a model to local or cloud compute just for a run.
08
Leaderboard & trends
Compare models and agents side by side and track how they change over time, with results you can automate via an agent interface.
What it answers
The decisions it resolves
In active development
The Benchmark is part of an informational preview. A login to use it in the browser is planned.