Banja Lab / Benchmarks
We give frontier AI models the same work we ship: code, UI components, full websites, SVG graphics, and Australian legal and accounting questions. What can be checked is graded by execution; the rest we review by eye.
The board is split into two tracks. An API run is a single-shot prompt to completion; an agent run uses a coding harness with tool use and iteration. The two are different kinds of run, so composites across the tracks are not directly comparable - each track is ranked on its own. Within a track, runs are grouped by how many tasks they cover, because a composite is only directly comparable within the same task set.
Open any model to inspect its outputs, then open a task to see every model’s output for it side by side.
One prompt in, one completion out - no tools, no iteration. Ranked independently of the other track.
Includes the new hard benchmarks: AU statutory facts, pixel / responsive / keyboard-accessibility build fidelity, diagram and screenshot-to-code, and the agentic BAS quarter.
The original core suite. These models have not yet been re-run on the new hard suites.
Run inside a coding harness with tool use and iteration, not a single API call. Ranked independently of the other track.
Includes the new hard benchmarks: AU statutory facts, pixel / responsive / keyboard-accessibility build fidelity, diagram and screenshot-to-code, and the agentic BAS quarter.
A new card lands here each time we benchmark another model or reasoning effort.
It measures how AI models perform on the real work Banja ships: writing code, building UI components and full websites, drawing SVG graphics, and answering Australian legal and accounting questions. Each entry is one model at one reasoning effort.
Objective tasks are graded by execution (unit tests run, numbers checked to tolerance, citations verified) - no model in the loop. The handful of visual builds are scored by a four-family vision panel (Anthropic, OpenAI, Google, xAI) against a fixed rubric, and the published number is leave-one-family-out, so a model is never judged by its own family. Every task is gate-validated first: a known-good answer must beat a battery of known-bad ones before it can score.
An API run is a single-shot prompt to completion: one call, no tools, no iteration. An agent run uses a coding harness that lets the model use tools and iterate before it answers. Those are different kinds of run, so the leaderboard splits into two tracks - "API - single shot" and "Agent - coding harness" - and ranks each independently. A composite from one track is not directly comparable to a composite from the other.
claude-opus-4-8 at low reasoning; claude-opus-4-8 at medium reasoning; claude-opus-4-8 at high reasoning; claude-opus-4-8 at extra-high reasoning; claude-opus-4-8 at max reasoning; claude-sonnet-4-6 at high reasoning; claude-sonnet-5 at high reasoning; claude-fable-5 at high reasoning; claude-haiku-4-5 at high reasoning; glm-5.2 at default reasoning; kimi-k2.7-code at default reasoning; gpt-5.5 at high reasoning; gpt-5.5-pro at high reasoning; gpt-5.4-mini at high reasoning; gemini-3.1-pro-preview at high reasoning; gemini-3.5-flash at default reasoning; gemini-3.1-flash-lite at default reasoning; grok-4.3 at default reasoning; grok-4.20-reasoning at default reasoning; grok-build-0.1 at default reasoning; grok-composer-2.5-fast at default reasoning; claude-opus-4-8 at high reasoning; claude-sonnet-4-6 at high reasoning; claude-sonnet-5 at high reasoning; claude-fable-5 at high reasoning; claude-haiku-4-5 at default reasoning; deepseek-v4-pro at default reasoning; deepseek-v4-flash at default reasoning.
Yes. Open a result and you can inspect every rendered output - the actual webpage, UI component, or SVG the model produced - alongside its score.
Each soft-signal visual build is scored by a four-family vision panel: Anthropic (claude-opus-4-8, run keyless on the agent rail), OpenAI (gpt-5.5), Google (gemini-3.1-pro), and xAI (grok-4.3). All four judges score the same full-page screenshot against the same four-criterion showcase rubric (brief fidelity, visual design, craft, impact) on a 1-5 scale, with an identical prompt, so the scores are comparable. The published judge axis is leave-one-family-out (LOO): a model is never scored by a judge of its own family, which removes same-family self-preference structurally. The old single-judge (claude-opus-4-8 only) score is retained alongside for comparison.
Scores its own family lower than the other judges do (own 0.58 vs others 0.62, n=60).
Scores its own family higher than the other judges do (own 0.82 vs others 0.70, n=18).
Scores its own family lower than the other judges do (own 0.49 vs others 0.62, n=18).
Scores its own family higher than the other judges do (own 0.79 vs others 0.58, n=24).
Measured self-preference: the Anthropic judge scored Anthropic-family models 0.041 lower (on the 0-1 axis) than the other two families scored the same builds (own mean 0.578 vs others 0.619, n=60 build-task pairs). The openai judge's self-preference delta was +0.118 (own 0.823 vs others 0.705, n=18). The google judge's self-preference delta was -0.131 (own 0.490 vs others 0.620, n=18). Inter-judge agreement across 168 build-task pairs: mean pairwise Pearson 0.7563, Spearman 0.7038, Krippendorff alpha 0.5627 (interval). To remove the bias from the published number, the soft-signal axis uses a leave-one-family-out panel: a model is never scored by a judge of its own family. The single-judge score is kept beside it so the correction is visible. The bulk of the benchmark (the objective suites) has no model in the loop at all.