Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: claude-opus-4-8 (low reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 100.0%. 27 models compared on this task.
Build a single self-contained HTML file (`index.html`) that renders one call-to-action button, with no build step and no network calls (inline all CSS, no external fonts or scripts). The grader measures the rendered button box and computed style, so match the spec exactly. Button `<button id="cta">`: - width: 220px - height: 52px - border-radius: 26px (a full pill at this height) - font-size: 17px, font-weight 600 - background colour: #16a34a - text colour: #ffffff - text: Get started free The 220 by 52 box, the 26px radius, the 17px font-size, and the #16a34a fill are the contract. A visually similar green pill that is off by a few px or a few hex points fails.