Banja Lab

Measured in the open.

The things we build to answer our own questions: experiments, research, and tools, shared in the open. We start with model benchmarks, run on the same work we ship for clients - currently 28 models across 87 tasks, topped by gpt-5.5-pro at 94.1% composite.

94.1%

Composite

CodeUISVGWebLawAcct

Benchmarks

Model benchmarks

How frontier AI models perform on the real build and reasoning work we do: code, UI, full websites, marketing pages, dashboards, animations, SVG, and Australian legal and accounting. Measured, with every output to inspect and compare.

Models: 28
Tasks each: 87
Scoring: Execution + vision

View benchmarks

More experiments land here as we build them.