Benchmark — Prove it holds up

Benchmark

The Benchmark proves which agent and model actually hold up on the work your teams really do, on your own hardware, before you commit. It evaluates complete agent systems, not individual models in isolation.

Coming soon
The ESACA Benchmark dashboard: an agent leaderboard and a radar chart comparing coding agents across scoring dimensions.
The Benchmark dashboard: agent leaderboard and per-dimension comparison.

What it measures

Six dimensions, end to end

Every build task is scored across six dimensions, so a result is more than just "did it run".

01

Correctness

Does it build and do its tests pass? Compilation, the share of passing tests and coverage against the pinned baseline.

02

Architecture

How clean is the structure? Static analysis, cyclomatic complexity, maintainability and adherence to the plan's guidelines.

03

Robustness

Does it hold up beyond the happy path? Type safety, dependency health and resilience that passing tests alone miss.

04

Efficiency

Is the solution lean? Lines of code, complexity and runtime behaviour, so a working result isn't needlessly bloated.

05

Auditability

Can a human follow what was done? Docstring coverage, dead-code checks and a full trail of transcripts and diffs.

06

Security

Is it safe to ship? An automated security review plus scans on the produced patch, kept separate from the functional score.

Agent, model, hardware, task

1

Step 1

Define the task: prompt, expected files, build and tests.

2

Step 2

Pick an agent, a model and a compute target.

3

Step 3

Run the agent in an isolated workspace.

4

Step 4

Score with hard metrics and human-style review.

5

Step 5

Compare on the leaderboard and over time.

Capabilities

Built for an honest decision

01

Agent + model + hardware

Every run pairs a coding agent, a model and a compute target, because the system as a whole is what matters, not the model alone.

02

Hard metrics vs. judgement

Mechanical checks (build, tests, static analysis, security scans) are kept separate from LLM-as-a-judge review, never blended into one misleading number.

03

Scenario tiers

Run the same use case at rising levels of difficulty and abstraction, to see where a model starts to struggle.

04

Help tiers

Run a model with no, some, or maximum context. The spread shows how much hand-holding it needs, which explains why a model suits one team and not another.

05

Your own tasks, kept private

Bring your own repositories and coding challenges. Testing on your private stack sidesteps the contamination of public leaderboards.

06

Reproducible & versioned

Pinned commits and a saved configuration snapshot make every run reproducible, so you can re-test on each new model release.

07

Read-only and write channels

Observe shared GPUs without disturbing them, or deploy a model to local or cloud compute just for a run.

08

Leaderboard & trends

Compare models and agents side by side and track how they change over time, with results you can automate via an agent interface.

What it answers

The decisions it resolves

Which model is worth hosting on your own infrastructure.
How a model handles your own coding challenges before you commit.
Why a model works for one team but not for another.
Whether a new release is actually better, tested the same way each time.
When a model is ready to promote into the Workbench.

In active development

The Benchmark is part of an informational preview. A login to use it in the browser is planned.

An initiative by
mgm technology partners Fsas Technologies, a Fujitsu company