Let us pitch to you

LET'S BUILD
THE FUTURE.

Start a Project

or

We build digital products for people who move fast.

Explore

•About •Case Studies •Blog •Careers •Contact

Services

•Product Design & Build •AI Agents & Automation •Website & Brand Setup

Products

Contact

helloremovethis@andthisbanja.au

50 Miller St
North Sydney NSW 2060

© 2026 Banja Labs. All rights reserved.

Privacy Policy Terms of Use

Banja Lab / Benchmarks

Model benchmarks.

We give frontier AI models the same work we ship: code, UI components, full websites, SVG graphics, and Australian legal and accounting questions. What can be checked is graded by execution; the rest we review by eye.

CodeUISVGWebLawAcct

The board is split into two tracks. An API run is a single-shot prompt to completion; an agent run uses a coding harness with tool use and iteration. The two are different kinds of run, so composites across the tracks are not directly comparable - each track is ranked on its own. Within a track, runs are grouped by how many tasks they cover, because a composite is only directly comparable within the same task set.

Open any model to inspect its outputs, then open a task to see every model’s output for it side by side.

Track 1

API - single shot

One prompt in, one completion out - no tools, no iteration. Ranked independently of the other track.

Expanded suite

87 tasks

Includes the new hard benchmarks: AU statutory facts, pixel / responsive / keyboard-accessibility build fidelity, diagram and screenshot-to-code, and the agentic BAS quarter.

API single-shotHigh reasoning

composite over 87 tasks

221,019 tokens~$8.88512026-07-01

API single-shotHigh reasoning

composite over 87 tasks

264,495 tokens~$7.03382026-06-23

claude-opus-4-8

API single-shotHigh reasoning

composite over 87 tasks

198,676 tokens~$3.86942026-06-25

gemini-3.1-pro-preview

API single-shotHigh reasoning

composite over 87 tasks

490,151 tokens~$5.50572026-06-23

claude-sonnet-5

API single-shotHigh reasoning

composite over 87 tasks

275,088 tokens~$3.47662026-06-30

deepseek-v4-pro

API single-shotdefault reasoning

composite over 87 tasks

452,628 tokens~$0.37752026-06-25

API single-shotHigh reasoning

composite over 87 tasks

936,673 tokens~$4.07992026-06-23

API single-shotdefault reasoning

composite over 87 tasks

190,758 tokens~$0.36282026-06-23

API single-shotdefault reasoning

composite over 87 tasks

129,215 tokens~$0.21032026-06-25

claude-sonnet-4-6

API single-shotHigh reasoning

composite over 87 tasks

442,101 tokens~$6.14522026-06-25

gemini-3.5-flash

API single-shotdefault reasoning

composite over 87 tasks

512,588 tokens~$4.33122026-06-23

API single-shotdefault reasoning

composite over 87 tasks

145,730 tokens~$0.30932026-06-23

deepseek-v4-flash

API single-shotdefault reasoning

composite over 87 tasks

312,336 tokens~$0.08252026-06-25

grok-4.20-reasoning

API single-shotdefault reasoning

composite over 87 tasks

219,591 tokens~$0.48852026-06-25

gemini-3.1-flash-lite

API single-shotdefault reasoning

composite over 87 tasks

79,278 tokens~$0.07152026-06-23

API single-shotdefault reasoning

composite over 87 tasks

99,587 tokens~$0.18852026-06-25

claude-haiku-4-5

API single-shotdefault reasoning

composite over 87 tasks

147,446 tokens~$0.57792026-06-25

Core suite

37 tasks

The original core suite. These models have not yet been re-run on the new hard suites.

API single-shotHigh reasoning

composite over 37 tasks

317,658 tokens~$55.63942026-06-23

Track 2

Agent - coding harness

Run inside a coding harness with tool use and iteration, not a single API call. Ranked independently of the other track.

Expanded suite

87 tasks

Includes the new hard benchmarks: AU statutory facts, pixel / responsive / keyboard-accessibility build fidelity, diagram and screenshot-to-code, and the agentic BAS quarter.

grok-composer-2.5-fast

Agent harnessdefault reasoning

composite over 87 tasks

161,568 tokens~$3.072026-06-25

claude-opus-4-8

Agent harnessExtra-high reasoning

composite over 87 tasks

196,310 tokens~$3.732026-06-24

claude-sonnet-4-6

Agent harnessHigh reasoning

composite over 87 tasks

199,350 tokens~$2.272026-06-24

claude-sonnet-5

Agent harnessHigh reasoning

composite over 87 tasks

208,524 tokens~$3.962026-06-30

claude-opus-4-8

Agent harnessHigh reasoning

composite over 87 tasks

194,553 tokens~$3.72026-06-24

Agent harnessHigh reasoning

composite over 87 tasks

200,633 tokens~$7.622026-07-01

claude-opus-4-8

Agent harnessLow reasoning

composite over 87 tasks

163,926 tokens~$3.112026-06-24

claude-opus-4-8

Agent harnessMax reasoning

composite over 87 tasks

251,203 tokens~$4.772026-06-24

claude-opus-4-8

Agent harnessMedium reasoning

composite over 87 tasks

181,820 tokens~$3.452026-06-24

claude-haiku-4-5

Agent harnessHigh reasoning

composite over 87 tasks

171,973 tokens~$0.652026-06-24

A new card lands here each time we benchmark another model or reasoning effort.

How it works

What the benchmark measures, and how.

What is the Banja Lab model benchmark?

It measures how AI models perform on the real work Banja ships: writing code, building UI components and full websites, drawing SVG graphics, and answering Australian legal and accounting questions. Each entry is one model at one reasoning effort.

How are the models scored?

Objective tasks are graded by execution (unit tests run, numbers checked to tolerance, citations verified) - no model in the loop. The handful of visual builds are scored by a four-family vision panel (Anthropic, OpenAI, Google, xAI) against a fixed rubric, and the published number is leave-one-family-out, so a model is never judged by its own family. Every task is gate-validated first: a known-good answer must beat a battery of known-bad ones before it can score.

Why are API and agent runs ranked separately?

An API run is a single-shot prompt to completion: one call, no tools, no iteration. An agent run uses a coding harness that lets the model use tools and iterate before it answers. Those are different kinds of run, so the leaderboard splits into two tracks - "API - single shot" and "Agent - coding harness" - and ranks each independently. A composite from one track is not directly comparable to a composite from the other.

Which models have been benchmarked?

claude-opus-4-8 at low reasoning; claude-opus-4-8 at medium reasoning; claude-opus-4-8 at high reasoning; claude-opus-4-8 at extra-high reasoning; claude-opus-4-8 at max reasoning; claude-sonnet-4-6 at high reasoning; claude-sonnet-5 at high reasoning; claude-fable-5 at high reasoning; claude-haiku-4-5 at high reasoning; glm-5.2 at default reasoning; kimi-k2.7-code at default reasoning; gpt-5.5 at high reasoning; gpt-5.5-pro at high reasoning; gpt-5.4-mini at high reasoning; gemini-3.1-pro-preview at high reasoning; gemini-3.5-flash at default reasoning; gemini-3.1-flash-lite at default reasoning; grok-4.3 at default reasoning; grok-4.20-reasoning at default reasoning; grok-build-0.1 at default reasoning; grok-composer-2.5-fast at default reasoning; claude-opus-4-8 at high reasoning; claude-sonnet-4-6 at high reasoning; claude-sonnet-5 at high reasoning; claude-fable-5 at high reasoning; claude-haiku-4-5 at default reasoning; deepseek-v4-pro at default reasoning; deepseek-v4-flash at default reasoning.

Can I see the actual outputs?

Yes. Open a result and you can inspect every rendered output - the actual webpage, UI component, or SVG the model produced - alongside its score.

How we judge, and its limits

The soft-signal score is a cross-family panel, and we measured its bias.

Each soft-signal visual build is scored by a four-family vision panel: Anthropic (claude-opus-4-8, run keyless on the agent rail), OpenAI (gpt-5.5), Google (gemini-3.1-pro), and xAI (grok-4.3). All four judges score the same full-page screenshot against the same four-criterion showcase rubric (brief fidelity, visual design, craft, impact) on a 1-5 scale, with an identical prompt, so the scores are comparable. The published judge axis is leave-one-family-out (LOO): a model is never scored by a judge of its own family, which removes same-family self-preference structurally. The old single-judge (claude-opus-4-8 only) score is retained alongside for comparison.

Anthropic judge

claude-opus-4-8

-4.1pts

self-preference

Scores its own family lower than the other judges do (own 0.58 vs others 0.62, n=60).

OpenAI judge

gpt-5.5

+11.8pts

self-preference

Scores its own family higher than the other judges do (own 0.82 vs others 0.70, n=18).

Google judge

gemini-3.1-pro-preview

-13.1pts

self-preference

Scores its own family lower than the other judges do (own 0.49 vs others 0.62, n=18).

xai judge

grok-4.3

+21.5pts

self-preference

Scores its own family higher than the other judges do (own 0.79 vs others 0.58, n=24).

Inter-judge agreement (n=168)Pearson 0.7563Spearman 0.7038Krippendorff alpha 0.5627

Measured self-preference: the Anthropic judge scored Anthropic-family models 0.041 lower (on the 0-1 axis) than the other two families scored the same builds (own mean 0.578 vs others 0.619, n=60 build-task pairs). The openai judge's self-preference delta was +0.118 (own 0.823 vs others 0.705, n=18). The google judge's self-preference delta was -0.131 (own 0.490 vs others 0.620, n=18). Inter-judge agreement across 168 build-task pairs: mean pairwise Pearson 0.7563, Spearman 0.7038, Krippendorff alpha 0.5627 (interval). To remove the bias from the published number, the soft-signal axis uses a leave-one-family-out panel: a model is never scored by a judge of its own family. The single-judge score is kept beside it so the correction is visible. The bulk of the benchmark (the objective suites) has no model in the loop at all.