Banja
About
Services
Products
Case Studies
Lab
Contact Us
Let us pitch to you

LET'S BUILD
THE FUTURE.

Start a Project
or
Meet Jett
banja.au

We build digital products for people who move fast.

Explore

•About•Case Studies•Blog•Careers•Contact

Services

•Product Design & Build•AI Agents & Automation•Website & Brand Setup

Products

•Boosta

Contact

helloremovethis@andthisbanja.au
50 Miller St
North Sydney NSW 2060

© 2026 Banja Labs. All rights reserved.

Privacy PolicyTerms of Use

Banja Lab / Benchmarks

Model benchmarks.

We give frontier AI models the same work we ship: code, UI components, full websites, SVG graphics, and Australian legal and accounting questions. What can be checked is graded by execution; the rest we review by eye.

CodeUISVGWebLawAcct

The board is split into two tracks. An API run is a single-shot prompt to completion; an agent run uses a coding harness with tool use and iteration. The two are different kinds of run, so composites across the tracks are not directly comparable - each track is ranked on its own. Within a track, runs are grouped by how many tasks they cover, because a composite is only directly comparable within the same task set.

Open any model to inspect its outputs, then open a task to see every model’s output for it side by side.

Track 1

API - single shot

One prompt in, one completion out - no tools, no iteration. Ranked independently of the other track.

Expanded suite

87 tasks

Includes the new hard benchmarks: AU statutory facts, pixel / responsive / keyboard-accessibility build fidelity, diagram and screenshot-to-code, and the agentic BAS quarter.

Anthropicclaude-fable-5
API single-shotHigh reasoning
82.1%
composite over 87 tasks
221,019 tokens~$8.88512026-07-01
View results
OpenAIgpt-5.5
API single-shotHigh reasoning
79.1%
composite over 87 tasks
264,495 tokens~$7.03382026-06-23
View results
Anthropicclaude-opus-4-8
API single-shotHigh reasoning
77.8%
composite over 87 tasks
198,676 tokens~$3.86942026-06-25
View results
Googlegemini-3.1-pro-preview
API single-shotHigh reasoning
77.7%
composite over 87 tasks
490,151 tokens~$5.50572026-06-23
View results
Anthropicclaude-sonnet-5
API single-shotHigh reasoning
77.3%
composite over 87 tasks
275,088 tokens~$3.47662026-06-30
View results
DeepSeekdeepseek-v4-pro
API single-shotdefault reasoning
76.5%
composite over 87 tasks
452,628 tokens~$0.37752026-06-25
View results
OpenAIgpt-5.4-mini
API single-shotHigh reasoning
76.0%
composite over 87 tasks
936,673 tokens~$4.07992026-06-23
View results
Zhipuglm-5.2
API single-shotdefault reasoning
75.0%
composite over 87 tasks
190,758 tokens~$0.36282026-06-23
View results
xAIgrok-build-0.1
API single-shotdefault reasoning
74.6%
composite over 87 tasks
129,215 tokens~$0.21032026-06-25
View results
Anthropicclaude-sonnet-4-6
API single-shotHigh reasoning
73.4%
composite over 87 tasks
442,101 tokens~$6.14522026-06-25
View results
Googlegemini-3.5-flash
API single-shotdefault reasoning
72.9%
composite over 87 tasks
512,588 tokens~$4.33122026-06-23
View results
Moonshotkimi-k2.7-code
API single-shotdefault reasoning
72.3%
composite over 87 tasks
145,730 tokens~$0.30932026-06-23
View results
DeepSeekdeepseek-v4-flash
API single-shotdefault reasoning
71.4%
composite over 87 tasks
312,336 tokens~$0.08252026-06-25
View results
xAIgrok-4.20-reasoning
API single-shotdefault reasoning
71.4%
composite over 87 tasks
219,591 tokens~$0.48852026-06-25
View results
Googlegemini-3.1-flash-lite
API single-shotdefault reasoning
69.5%
composite over 87 tasks
79,278 tokens~$0.07152026-06-23
View results
xAIgrok-4.3
API single-shotdefault reasoning
67.1%
composite over 87 tasks
99,587 tokens~$0.18852026-06-25
View results
Anthropicclaude-haiku-4-5
API single-shotdefault reasoning
60.4%
composite over 87 tasks
147,446 tokens~$0.57792026-06-25
View results

Core suite

37 tasks

The original core suite. These models have not yet been re-run on the new hard suites.

OpenAIgpt-5.5-pro
API single-shotHigh reasoning
94.1%
composite over 37 tasks
317,658 tokens~$55.63942026-06-23
View results
Track 2

Agent - coding harness

Run inside a coding harness with tool use and iteration, not a single API call. Ranked independently of the other track.

Expanded suite

87 tasks

Includes the new hard benchmarks: AU statutory facts, pixel / responsive / keyboard-accessibility build fidelity, diagram and screenshot-to-code, and the agentic BAS quarter.

xAIgrok-composer-2.5-fast
Agent harnessdefault reasoning
85.0%
composite over 87 tasks
161,568 tokens~$3.072026-06-25
View results
Anthropicclaude-opus-4-8
Agent harnessExtra-high reasoning
84.2%
composite over 87 tasks
196,310 tokens~$3.732026-06-24
View results
Anthropicclaude-sonnet-4-6
Agent harnessHigh reasoning
83.9%
composite over 87 tasks
199,350 tokens~$2.272026-06-24
View results
Anthropicclaude-sonnet-5
Agent harnessHigh reasoning
83.0%
composite over 87 tasks
208,524 tokens~$3.962026-06-30
View results
Anthropicclaude-opus-4-8
Agent harnessHigh reasoning
82.4%
composite over 87 tasks
194,553 tokens~$3.72026-06-24
View results
Anthropicclaude-fable-5
Agent harnessHigh reasoning
81.9%
composite over 87 tasks
200,633 tokens~$7.622026-07-01
View results
Anthropicclaude-opus-4-8
Agent harnessLow reasoning
81.8%
composite over 87 tasks
163,926 tokens~$3.112026-06-24
View results
Anthropicclaude-opus-4-8
Agent harnessMax reasoning
80.7%
composite over 87 tasks
251,203 tokens~$4.772026-06-24
View results
Anthropicclaude-opus-4-8
Agent harnessMedium reasoning
79.1%
composite over 87 tasks
181,820 tokens~$3.452026-06-24
View results
Anthropicclaude-haiku-4-5
Agent harnessHigh reasoning
58.5%
composite over 87 tasks
171,973 tokens~$0.652026-06-24
View results

A new card lands here each time we benchmark another model or reasoning effort.

How it works

What the benchmark measures, and how.

What is the Banja Lab model benchmark?

It measures how AI models perform on the real work Banja ships: writing code, building UI components and full websites, drawing SVG graphics, and answering Australian legal and accounting questions. Each entry is one model at one reasoning effort.

How are the models scored?

Objective tasks are graded by execution (unit tests run, numbers checked to tolerance, citations verified) - no model in the loop. The handful of visual builds are scored by a four-family vision panel (Anthropic, OpenAI, Google, xAI) against a fixed rubric, and the published number is leave-one-family-out, so a model is never judged by its own family. Every task is gate-validated first: a known-good answer must beat a battery of known-bad ones before it can score.

Why are API and agent runs ranked separately?

An API run is a single-shot prompt to completion: one call, no tools, no iteration. An agent run uses a coding harness that lets the model use tools and iterate before it answers. Those are different kinds of run, so the leaderboard splits into two tracks - "API - single shot" and "Agent - coding harness" - and ranks each independently. A composite from one track is not directly comparable to a composite from the other.

Which models have been benchmarked?

claude-opus-4-8 at low reasoning; claude-opus-4-8 at medium reasoning; claude-opus-4-8 at high reasoning; claude-opus-4-8 at extra-high reasoning; claude-opus-4-8 at max reasoning; claude-sonnet-4-6 at high reasoning; claude-sonnet-5 at high reasoning; claude-fable-5 at high reasoning; claude-haiku-4-5 at high reasoning; glm-5.2 at default reasoning; kimi-k2.7-code at default reasoning; gpt-5.5 at high reasoning; gpt-5.5-pro at high reasoning; gpt-5.4-mini at high reasoning; gemini-3.1-pro-preview at high reasoning; gemini-3.5-flash at default reasoning; gemini-3.1-flash-lite at default reasoning; grok-4.3 at default reasoning; grok-4.20-reasoning at default reasoning; grok-build-0.1 at default reasoning; grok-composer-2.5-fast at default reasoning; claude-opus-4-8 at high reasoning; claude-sonnet-4-6 at high reasoning; claude-sonnet-5 at high reasoning; claude-fable-5 at high reasoning; claude-haiku-4-5 at default reasoning; deepseek-v4-pro at default reasoning; deepseek-v4-flash at default reasoning.

Can I see the actual outputs?

Yes. Open a result and you can inspect every rendered output - the actual webpage, UI component, or SVG the model produced - alongside its score.

How we judge, and its limits

The soft-signal score is a cross-family panel, and we measured its bias.

Each soft-signal visual build is scored by a four-family vision panel: Anthropic (claude-opus-4-8, run keyless on the agent rail), OpenAI (gpt-5.5), Google (gemini-3.1-pro), and xAI (grok-4.3). All four judges score the same full-page screenshot against the same four-criterion showcase rubric (brief fidelity, visual design, craft, impact) on a 1-5 scale, with an identical prompt, so the scores are comparable. The published judge axis is leave-one-family-out (LOO): a model is never scored by a judge of its own family, which removes same-family self-preference structurally. The old single-judge (claude-opus-4-8 only) score is retained alongside for comparison.

Anthropic judge
claude-opus-4-8
-4.1pts
self-preference

Scores its own family lower than the other judges do (own 0.58 vs others 0.62, n=60).

OpenAI judge
gpt-5.5
+11.8pts
self-preference

Scores its own family higher than the other judges do (own 0.82 vs others 0.70, n=18).

Google judge
gemini-3.1-pro-preview
-13.1pts
self-preference

Scores its own family lower than the other judges do (own 0.49 vs others 0.62, n=18).

xai judge
grok-4.3
+21.5pts
self-preference

Scores its own family higher than the other judges do (own 0.79 vs others 0.58, n=24).

Inter-judge agreement (n=168)Pearson 0.7563Spearman 0.7038Krippendorff alpha 0.5627

Measured self-preference: the Anthropic judge scored Anthropic-family models 0.041 lower (on the 0-1 axis) than the other two families scored the same builds (own mean 0.578 vs others 0.619, n=60 build-task pairs). The openai judge's self-preference delta was +0.118 (own 0.823 vs others 0.705, n=18). The google judge's self-preference delta was -0.131 (own 0.490 vs others 0.620, n=18). Inter-judge agreement across 168 build-task pairs: mean pairwise Pearson 0.7563, Spearman 0.7038, Krippendorff alpha 0.5627 (interval). To remove the bias from the published number, the soft-signal axis uses a leave-one-family-out panel: a model is never scored by a judge of its own family. The single-judge score is kept beside it so the correction is visible. The bulk of the benchmark (the objective suites) has no model in the loop at all.